More and more, people are asking me “how do you compare Snowflake and Databricks?”
I have to admit, I’ve avoided the question a little. For one, it’s not an apples to apples comparison. They both support analytics, and they have increasingly been competing with each other. But they’re not quite the same thing. Snowflake is a data warehouse that now supports ELT. Databricks, which is built on Apache Spark, provides a data processing engine that many companies use with a data warehouse. They can also use Databricks as a data lakehouse by using Databricks Delta Lake and Delta Engine.
But it’s a really important question, in part because many companies already use Databricks, Snowflake, or both. So here’s my best answer, which I’ve broken into two parts.
- First, how should you compare a data lakehouse and data warehouse, and when should you choose one or the other for a given use case?
- Second, how do Databricks and Snowflake compare today? I’ve included a detailed comparison across more than 30 categories spanning architecture, performance, scalability, use cases, and cost based on the criteria in the data warehouse comparison guide.
Data lakehouse vs data warehouse
First let’s define a data lake, data lakehouse, and data warehouse, and their purpose.
A data warehouse is the database of choice for general-purpose analytics, including reporting, dashboards, ad hoc, and any other high-performance analytics. It gets its performance in part by sorting, pre-computing, and compressing data during ingestion for certain types of operations including joins and aggregations. In part because the focus is on performance, a data warehouse generally only holds structured and more recently semi-structured data. A modern cloud data warehouse supports “data-lake scale” meaning you can hold massive data sets and mostly only pay cloud storage costs. You do have to access the data through the data warehouse engine, which adds compute costs as well.
A data lake is a data store (only) for any raw structured, semi-structured, and unstructured data that makes data easily accessible to anyone. You can use it as a batch source for a data warehouse or any other workload. For example, it enables self-service ELT for any workload so that you can quickly add new data to a data warehouse.
A data lakehouse is often described as a new, open data management architecture that combines the best of a data lake with a data warehouse. The goal is to implement the best of a data lake and a data warehouse, and to reduce complexity by moving more analytics directly against the data lake, thereby eliminating the need for multiple query engines.
But getting to this goal is tricky. A data lakehouse is usually implemented as a combination of a data lake with a federated query engine. That poses several challenges today:
- It can limit access to the raw data to one query engine, which defeats the purpose of a data lake as a data store to help make raw data accessible to everyone.
- The data in a data lake is raw data. You need to organize all of the data or make data easily discoverable.
- Data lakehouses do not yet have as broad a set of integrations with ETL/ELT tools or BI tools as data warehouses.
- Perhaps most importantly, data lakehouses do not have the performance of a data warehouse. That requires recreating everything done in ETL/ELT and data warehouse ingestion, not to mention the other optimizations in the best data warehouse engines. It takes time.
Despite all these challenges, I do recommend you consider offering a federated SQL engine as a default service with your data lake. Not only does it make it easy for new groups without their own tools to access the raw data. If you put together the right team you will deliver better, faster SQL access and lower costs for the other teams. One great service will be less expensive than several mediocre sets of tools spread out across groups.
I just can’t recommend a single engine that combines a data lake, data warehouse and other specialized analytics needs because it does not exist today.
When should you choose a data warehouse or a data lakehouse? Eventually, you should end up with an architecture where your general-purpose analytics, especially your interactive analytics that need performance, are supported by a data warehouse. You should have all your raw data in a data lake.
A data lakehouse can support reporting and dashboards, so long as you don’t need your queries to run in a few seconds or less every time. It can also be your go-to place for one-off analytics to answer specific questions with new data, or for prototyping.
There is a framework you can use to decide as well. Just answer these three questions.
- How fast do the queries need to run?
- Do you know the data you need to use in advance?
- Do you know the analytics you need to run in advance?
If you want all the details, I answered them in another blog on how to choose.
You can also think about when you add each new technology over time as you modernize your data pipeline and overall data infrastructure.
- Add a data lake the moment you need to store raw data, and do not know how you will need to use it, or you need to improve how you perform ELT into your data warehouse.
- Add a data lakehouse or federated query engine to a data lake the moment anyone needs SQL access to raw data for any workload including one-off reports or prototyping.
- For general-purpose reporting and dashboards, use a data warehouse.
- Add even faster engines for interactive internal or external analytics and data applications as you need them.
A data lakehouse cannot be as fast as a high-performance data warehouse today because it does not structure the data in advance to improve query performance the same way.
Be ready to choose several engines. Do not resist it. With a modern data pipeline, you can choose the right engine for each workload when you need them because the value of each engine easily outweighs the cost. New deployments are much faster and lower cost with modern data pipelines, and then you pay as you go for compute. Even if you have multiple copies, storage is relatively cheap across all solutions.
As an example, all of our customers run Firebolt side-by-side with Snowflake, BigQuery, Redshift, Athena, or Presto. They added Firebolt to improve performance and efficiency at scale for the fast analytics that weren’t working well. The rest was working well, so why replace it?
Databricks vs Snowflake vs Firebolt
Now for the second part; how do Databricks and Snowflake compare as vendors?
Many companies use them together; Databricks for data processing (ETL), and Snowflake as their data warehouse. But Databricks and Snowflake have been steadily moving into each other’s core markets - ETL and data processing, and data warehousing/lakehousing - for some time as they both try to become a data platform of choice for multiple workloads.
If you compare Snowflake vs Databricks Delta Lake and Delta Engine today, at first glance they may seem pretty similar (see below.) For example, they can both support reporting and dashboard workloads in the cloud.
Snowflake has marketed themselves as a lakehouse for several years. Decoupled storage and compute means you can store any amount of data in Snowflake at cloud storage costs. Snowflake has also been steadily expanding their ELT capabilities. Many companies run their ELT with Snowflake. For example, some use Airflow to orchestrate data movement outside of Snowflake, and dbt with Snowflake, or Snowflake Tasks and Streams to orchestrate SQL tasks in Snowflake.
Databricks became a data lake vendor more recently. They first added Delta Lake, which is a data lake built on their own versioned Parquet. Delta Lake requires integration to make it work with other engines. While Delta Lake doesn’t yet have broad data access, other engines can directly access the data as well, and the list of integrations has been growing. Then they added Delta Engine, an open source federated query engine which is a combination of Databricks Photon, a specialized engine for SparkSQL, and a caching layer. They have also been steadily expanding their (Spark) SQL support so that you can do ELT/ETL using more SQL.
For data processing, if you need to go beyond SQL, Apache Spark or Databricks are great options. But for analytics use cases, there are really two major differences that lead you to choose one vendor over the other.
The first big difference is the types of data that can be stored and processed. Databricks Delta Lake is a data lake that can store raw unstructured, semi-structured, and structured data. When combined with Delta Engine it becomes a data lakehouse.
Snowflake supports semi-structured data, and is starting to add support for unstructured data as well. But today, if you really need a data lake for all your raw data, a data lake is a better option. Evaluate Delta Lake and Delta Engine versus a data lake and Presto, for example.
The second big difference is performance. If you need query performance for interactive queries, you will probably end up choosing a data warehouse like Snowflake, BigQuery, Redshift, or Firebolt.
Delta Lake has Delta Engine as a query engine optimized to run SparkSQL In practice Databricks has shown 2-4x acceleration of SparkSQL for deployments, and claims up to 60x performance improvements for specific queries.
This should not be too surprising. Delta Engine consists of a C++ based vectorized SQL query optimization and execution engine (Photon) and caching on top of Delta Lake versioned Parquet. Databricks and the broader Spark community know best how to optimize SparkSQL. They can optimize for Apache Arrow or another internal format to avoid the cost of serialization and deserialization. It also adds hash aggregations to accelerate GROUP BY and other aggregation operators. The Databricks version of Delta Engine brings more advanced features like caching or data collocation (Z-Order). Delta Lake also has other features similar to a data warehouse, like vacuuming, to help with performance.
But even with this performance acceleration, Delta Engine on Delta Lake may struggle to match the performance of a data warehouse for more interactive query workloads where caching doesn’t help as much because a data warehouse optimizes storage for data access by the query engine during ingestion.
For example, Snowflake has cluster keys to sort data and micro-partition pruning to reduce the number of partitions accessed and stored in the Snowflake virtual warehouse (compute) cache. Snowflake also adds search optimization, which improves the performance of returning small subsets of rows. Cluster keys, pruning, and search optimization all shrink the amount of data fetched and scanned compared to a federated query engine like Delta Engine that does not have these features. Given that Snowflake also has an optimized, and vectorized query engine, you should expect Snowflake to be faster for any ad hoc or interactive queries that need to fetch data from remote storage.
Firebolt goes even farther on indexing. Sparse (primary) indexes sort data based on multiple columns in the order they’re listed in the index definition. This index shrinks data access down to exact data ranges that are much smaller than partitions. Aggregating indexes precompute aggregations during ingestion and replace compute and data access with results in cache without having to rewrite your queries. Join indexes do a similar thing for joins.
The results are 10x or less data accessed and stored in compute. This translates to 10x or faster data access times, and 10x or faster query execution since 10x or less data is scanned and processed even before you add in performance increases from using cached, precomputed results. This is how customers have been able to achieve 10x or faster performance and 10x lower cost in their own benchmarks.
The detailed comparison - Databricks vs Snowflake vs Firebolt
Now for the detailed comparison across more than 20 categories of Databricks vs Snowflake vs Firebolt, including:
- Use cases
I’ve included Firebolt because, well, I work for Firebolt. But these three do cover the full range of analytics and data applications across analytics, employees, and customers, so it should help you understand the key criteria for each type of analytics and data apps as well.
*While Google BigQuery publishes official quotas and other limits, in fact you can ask for them to be raised when you have reserved slots. None appear to be hard technical limits.
In short, Databricks Delta Lake and Delta Engine is a lakehouse. You choose it as a data lake, and for data lakehouse based workloads including ELT for data warehouses, data science and machine learning, even static reporting and dashboards if you don’t mind the performance difference and don’t have a data warehouse.
Most companies still choose a data warehouse like Snowflake, BigQuery, Redshift or Firebolt for general purpose analytics over a data lakehouse like Delta Lake and Delta Engine because they need performance.
But it doesn’t matter. You need more than one engine. Don’t fight it. You will end up with multiple engines for very good reasons. It’s just a matter of when.
Eventually one vendor may combine all these capabilities into a single offering and make it much easier for more companies to build and manage data pipelines. But for now, choose several to get the best of lakehouses and data warehouses with your data pipeline.