Where there is data there is analytics. Over the last decade, as cloud computing has taken hold, so have cloud data warehouses. Nearly a decade ago, Snowflake released one of the first modern cloud data warehouses to enter the market, with elastic scalability built on the separation of storage and compute. Since then, Snowflake has become one of the leading cloud data warehouses in market share along with RedShift, BigQuery, and Azure Synapse.
But as cloud data warehouse usage has grown, so have the challenges. Some challenges were the same ones on-premises data warehouses had for decades, such as the lack of query performance needed for ad hoc analytics, or the ever-increasing volumes of batch data. Other challenges are newer, such as the need to support ever-increasing volumes of streaming and semi-structured data, or hundreds to thousands of concurrent users.
In short, what companies need out of a cloud data warehouse has changed. Snowflake is great for moving historical reporting and business intelligence (BI) workloads to the cloud. But it is not well suited for ad hoc analytics. Even at its fastest, Snowflake is still slower than several of the on-premises data warehouses it has replaced. It is also not as well suited for semi-structured or streaming data, or for operational or customer-facing analytics. Snowflake can also get very expensive, very fast. You may already have heard stories about “credit fever”, or single one-off queries that blew the credit budget.
There have been several technical innovations added to cloud data warehouses to help deliver more performance and scalability out of existing infrastructure, and lower costs. Some came straight from legacy data warehouses. Others are completely new, similar to the types of innovations that happened with Hadoop for batch computing.
Firebolt is different. It has added many of these new innovations to improve performance, scale and efficiency. With Firebolt, companies have achieved sub-second performance at gigabyte to petabyte scale, and achieved 4-6000x faster performance across different queries when compared to Snowflake in their benchmarks.
This whitepaper provides a detailed comparison between Firebolt and Snowflake across more than 30 different categories including their overall architectures and core features, scalability, performance, and cost, as well as their suitability across different analytics use cases. It begins with the summary comparison as a table, and then proceeds to explain the differences in each major category in more detail.
Snowflake was one of the first modern cloud data warehouses to enter the market nearly a decade ago. Since then, it has become one of the leading cloud data warehouses in market share. It was a 2nd-generation data warehouse that combined the separation of storage and compute for elastic scalability with the simplicity of SaaS and eliminating tuning and other administrative tasks. Originally released in 2012 on AWS as a shared multi-tenant service, Snowflake can now be deployed on AWS, Azure (2018) and Google Cloud (2020) and also as a Virtual Private Snowflake (VPS) in its own isolated tenant per customer.
Firebolt is much newer to market. After several years of development, Firebolt came out of stealth in 2020 as a cloud-native service on AWS. It is very similar to Snowflake in that it is built on decoupled storage and compute. But it is also the first cloud data warehouse to focus on improving performance, scalability and cost for newer analytics workloads including high performance, interactive ad hoc, semi-structured data, and operational and customer-facing analytics and data applications. By combining more recent innovations in storage, indexing, query optimization, and query execution, Firebolt has been able to deliver an order of magnitude improvement in performance, with sub-second performance from gigabyte to petabyte scale. It has also cut costs an order of magnitude by improving the efficiency of compute and storage, and by allowing customers to choose the rise size and number of AWS instance types that are best for each workload.
This whitepaper starts with the conclusion; a detailed comparison of Firebolt and Snowflake in a single table across more than 30 categories, including:
What follows the table is a more detailed explanation for each major category.
Snowflake has three major layers in its architecture: storage, compute and cloud services. The storage layer contains all data and is encrypted with separate keys for each customer. Snowflake is by default - for standard, enterprise and business critical editions - multi-tenant. While customer data is protected with individual customer keys and encrypted at rest, it is in a shared tenancy. Only Virtual Private Snowflake (VPS) offers isolated tenancy per customer in an isolated Snowflake account.
The unit of computing is a virtual warehouse (also called a warehouse), a cluster of compute nodes dedicated to a specific customer. At any time a customer can provision a new virtual warehouse with 1, 2, 4, 8, 16, 32, 64 or 128 nodes where each larger warehouse also has larger individually-sized nodes. Snowflake does not disclose the node details such as the instance type, CPU, RAM, SSD or disk. When you provision a new warehouse, users are assigned to it, and any data needed by those users is loaded as needed. The cloud services are for managing users, access, security and other aspects of Snowflake.
Snowflake has several methods of ingesting data, including Snowpipe for batch ingestion and direct writes. But Snowflake is more batch-centric. Snowpipe does not recommend batch intervals more frequent than 1 minute, and batches up streams internally for (micro) batch-based ingestion. In addition each table can only have up to 20 queued DML (write) statements, and micro-partitions, the standard block of storage, is an immutable columnar format that must be rewritten with each individual write. These limitations mean Snowflake is less suitable for continuous ingestion.
Snowflake’s security is effectively on par with the security of other offerings. Most encrypt data at rest, secure network connections, provide firewall protection that includes whitelist/blacklist level control, and provide role based access control (RBAC).
Firebolt isalso a multi-tenant SaaS cloud data warehouse. It stores each customer’s data in isolated S3 storage and provisions new compute resources as needed in its own account. It also has multi-tenant services for administration, storing metadata about deployments, and security, all similar to Snowflake’s.
Customers can start up dedicated compute clusters, called engines, by choosing a number of nodes, from 1 to 128 nodes, and just about any type of node. Administrators can then assign any combination of users and specific databases to multiple engines, and run different workloads on them. For example, you can have one cluster doing ingestion/ETL for a few databases, and another cluster that queries the same databases. As with Snowflake, you can resize an engine at any time. There is no limit to the number of engines you can run, and no physical limit to database size.
Similar to Snowflake, Firebolt will pull data into the local engine as needed. Unlike Snowflake, the Firebolt File Format (F3), pronounced “TripleF” is optimized to improve both ingestion and query performance. F3 as a data access layer manages data across S3 storage, engines and RAM. During ingestion, any number of Firebolt engines can perform batch and continuous ingestion at any scale. Engine nodes ingest data in parallel, with formats ranging from Parquet, ORC, or AVRO to JSON. Firebolt has implemented a multi-master write architecture, which means any node can perform any write. The writes are non-locking, meaning the writes are immediately made as new segments. As the data is ingested, it is segmented, optimized, and written to F3 storage using operating-system-level drivers designed for sparse storage. The moment data is ingested into cache it is visible in queries, and a query always returns the latest data. In addition, each incremental write is done without having to immediately rewrite an existing segment.
Firebolt also manages and stores indexes with the data. This includes sparse (primary) indexes for accessing data, aggregating indexing to accelerate dimensional analytics, and join indexes, which replace full scans with faster operations such as lookups. Throughout ingestion, all indexes are continuously updated with each write. As with data, indexes are fetched from storage and cached locally for performance.
Overall Snowflake provides solid elastic scalability, with a few exceptions. The unit of compute scalability with Snowflake is a warehouse. The only way to scale for query size or complexity is to scale up a warehouse, where each larger size increases the size of each node instance incrementally and doubles the number of nodes. This can make scaling to handle increased query size and complexity very expensive (more in the cost section.) When you increase or decrease the size of your warehouse, Snowflake provisions the new warehouse (usually within minutes) and moves users to it. Any data needed is loaded as needed from storage.
The way to scale users is to add more warehouses. You have a choice to manually assign different users to different warehouses (of different sizes), or to use the multi-cluster option available with enterprise and higher editions. Multi-cluster lets you choose a minimum and maximum number of warehouses from 1-10. You can configure Snowflake to automatically add another warehouse of the same size if it detects any queuing of queries (standard), or queuing greater than 6 minutes (economy.) Snowflake will also automatically load balance, start up, and shut down additional warehouses when these conditions are no longer met.
While this approach automates scaling, you can suddenly find yourself paying for an entire new warehouse to support a single extra user, which can be very expensive if performance requires large warehouse sizes.
In addition, Snowflake is better suited for batch-based ingestion. Its current limitations of 1 minute minimum intervals for Snowpipe, 20 queued writes per table, and micro-partition level locking (because the entire micro-partition needs to be rewritten with each write), limit continuous ingestion scalability.
Firebolt has a similar architecture in that you can assign any users to any engines, which are the equivalent of warehouses in Snowflake. It not only scales for batch, but supports continuous ingestion as well (see the architecture section.) Firebolt does not currently provide auto-scaling, though it is planned and can be mostly automated through scripts today to start resources and can be configured to automatically stop unused resources. As with Snowflake, there is no limit in user scaling; you can provision any number of warehouses of any node size up to 128 nodes. Where Firebolt shines is in more efficient scaling from features that include the ability to choose any node size, continuous ingestion, sparse indexing and F3 storage that lead to more efficient data access, and semi-structured data support.
While Snowflake was a true innovator in providing elastic scale, it did not significantly improve performance. In fact, Snowflake can be slower than several of the legacy on-premises data warehouses it has been replacing over time. This is in part because Snowflake has focused on simplicity of administration, and not focused on performance optimization or tuning options.
With Snowflake you do not have a lot of options to improve performance. You do not know and cannot choose the size of individual nodes, for example. You can either increase the size of your warehouse, add materialized views, leverage cluster keys, or use the more recently added search optimization.
Improving query performance for complex queries, large data sets and semi-structured data is a major challenge. They all require nodes that can hold all the required data in RAM to deliver fast performance. Otherwise, as a node runs out of RAM it will start to spill data from RAM to disk (as virtual memory), and this paging will dramatically slow down query performance. While you cannot see the exact instance size, each larger warehouse size provides a slightly larger node size as well. So you can increase the warehouse size until you have large enough nodes. But that means doubling the cost of the warehouse and the number of nodes with each increase until you reach a large enough node that can hold the data in RAM.
You also need to understand the difference between first-time query, and repetitive query performance. Snowflake is used primarily to support more traditional reporting and dashboard-based applications to the cloud. It has a tiered caching architecture that performs well when the same queries are performed many times. But the architecture does not support ad hoc well because a first-time query will easily take tens of seconds to minutes, and many ad hoc queries are first-time queries.
The first time data is needed, data is transferred from remote storage into the virtual warehouse local cache storage such as SSD. A sizable query, according to Fivetran and other benchmarks, can easily take tens of seconds to minutes. The query result is then stored in the result cache. Once all of the data is stored in the local warehouse cache, query times can deliver 10x faster, or second-level performance. If the exact query result already exists in the result cache, the query can easily return the result 10x faster than the local disk cache with sub-second performance. But it requires the query to be the same and the original data to be unchanged.
The other way to improve performance is by using materialized views, which provide an up-to-date result of a query. This can significantly improve the performance of complex queries, including queries against semi-structured data. But given the additional costs in compute and storage, it is only useful for repeated queries against slowly changing data. They are also limited in the SQL they support, and can only be used with a single table. They do not support joins.
Snowflake has added some other performance optimizations. Some, such as query vectorization, are becoming more common. A slightly more unique optimization in Snowflake is the combination of micro-partitions and cluster keys. Micro-partitions are Snowflake’s contiguous units of columnar storage. They vary in size from 50-500MB, in part to support updates. Whenever an update happens, the entire partition must be re-written because a micro-partition is immutable. But this is done transparently. Data ranges are maintained for each micro-partition to help with pruning of micro-partitions during queries to improve performance. In addition to manual sorting, you can also choose a subset of columns as a cluster key. Snowflake will automatically cluster data within micro-partitions based on the composite key. Cluster keys help improve columnar compression by clustering similar values together, in addition to improving query performance through the pruning of unneeded ranges of micro-partitions.
More recently Snowflake added search optimization, which helps improve the performance of filter and other operations that return small result sets for large (300GB+) tables with a large number of distinct values. It works well even when data is not clustered on the related columns. But it only works well for high cardinality columns and small result sets.
Firebolt’s biggest innovations are in performance and price-performance. With elastic scalability, where doubling the resources for longer queries can cut query times in half, it is important to measure price-performance (performance gain X price advantage.) In benchmarks, Firebolt has delivered 10x or faster performance compared to the alternatives. Compared to Snowflake, customer benchmarks have shown 4-6000x performance gains across queries
The core of Firebolt’s innovations are around the combination of its storage, indexing and query engine, optimized together for performance as well as elastic scalability. Its query engine is written in C++ for fast (sub-second) and predictable (low standard deviation) query performance.
For storage optimization, F3 provides a unified data access layer that transparently manages data caching and access across the tiered data layers from local cache to decoupled data storage. The query optimizer leverages sparse indexing and any join or aggregating indexes to identify the location of data and compiles query plans just in time (JIT) using cost-based optimization. This includes reordering query plans based on getting results from indexes or caches instead of remote storage, and performing pushdown optimization. Other optimizations include vectorized processing, data collocation, and native semi-structured data support.
Firebolt also provides indexing. While Snowflake does keep track of data ranges in micro-partitions (and cluster keys) to help prune micro-partitions out as a way to improve performance, it does not provide any indexing within a micro-partition. It relies on columnar storage to make full scans faster. Firebolt provides sparse indexing for much more granular data pruning, aggregating indexes for dimensional analytics, search indexes, and join indexes that replace full scans with lookups and other operations. Snowflake has no equivalent since its materialized views do not support joins or as broad a range of aggregation functions.
Firebolt also allows you to choose any node type, and any number of nodes. It means you can achieve the best balance of CPU, RAM, SSD, and scaleout for the best combination of price-performance. For example, you can choose a small number of very large nodes to support complex queries, versus having to go up to a 128 node Snowflake cluster just to get the largest the node size.
Another key advantage is Firebolt’s native support for JSON. Snowflake stores JSON as raw text in a VARIANT column. While it creates and stores metadata to help process the JSON, and also sometimes creates what are basically internal columns to accelerate JSON operations, Snowflake often still has to load JSON into RAM first, and then perform full scans for processing. This can end up being very expensive, because you need to choose large cluster sizes to get large enough nodes that can fit all the JSON in RAM, and slow because you need to perform full scans. Firebolt allows any combination of flattening JSON or storing it natively as a nested array structure. You can UNNEST any data, and load JSON natively entirely within SQL using a few commands. The preserved JSON can then be queried within SQL using native Lambda functions. When used, not only is the data stored in an efficient fashion where individual elements can be compressed. Operations can be performed without full scans.
Firebolt has also added capabilities that make it much better suited for continuous ingestion-based workloads compared to Snowflake. Multi-master lock-free ingestion, combined with the ability to see data in cache the moment it is ingested means Firebolt can support low latency use cases better than Snowflake.
Snowflake simplifies administration, but the cost of inefficient scaling and performance, and its pricing can outweigh the benefits if you are willing to manage and tune clusters.
Snowflake offers 4 editions: standard, enterprise, business critical and Virtual Private Snowflake (VPS) with $2, $3, $4 and undisclosed (higher) pricing respectively per credit.
You can think of Snowflake pricing as a markup on compute. A credit is a node compute-hour. A business critical 128 node warehouse will cost you $512 an hour, or $4.49 million list price for 8,760 hours a year in compute. Storage does not really have any markup compared to AWS S3 prices, and is cheap at $23 per month for prepaid or $40 per month for on-demand storage. But you do get charged for other data needs such as staging data. A petabyte costs $276K annually at list price.
Snowflake does charge for other services in addition to a virtual warehouse. Snowpipe, which is used for batch data ingestion, does not require a warehouse to be running but does charge for the compute resources and a fixed price per file. Database replication costs are similar: you are charged for compute resources per second along with storage and data transfer. So are materialized view costs.
The biggest challenge with Snowflake is that your main option for improving performance (see the Performance section) is scaling up your warehouse. A query can only run within a single warehouse, so the only way to partition the work is to grow the warehouse size, which doubles both the warehouse size and the cost each size up. If you need to improve individual node performance, you have to keep scaling your warehouse to incrementally grow the node size until your node is big enough. Otherwise RAM will start to spill over to disk, and performance will start to drop fast.
Firebolt has repeatedly delivered 10x or greater price-performance through a combination of greater compute efficiency per node, choice of resources for each workload, and ability to optimize those resources. Firebolt also puts a focus on simplifying administration by automating several of the more complicated tasks. For example, you can simply resize an engine with a few clicks, and Firebolt handles all the reprovisioning. It automatically partitions data across differently sized nodes, and configures and updates indexes. It also automatically rebalances data in F3 over time.
Firebolt does expose more tuning options, which does require someone to spend time selecting different instance types, or configuring indexing in the administration console. But this is comparable to the time administrators might spend in Snowflake trying to improve performance without having the control to do so. Snowflake administrators often analyze performance using the Query Profile to look at the time spent on operations before they decide to rewrite a query, or to increase and decrease the size of a cluster based on disk spillage as an indicator of available RAM in the cluster (since they do not know the actual size of the instance type.)
The big differences in cost are in the total cost of compute. In terms of compute costs, Firebolt is 10x more efficient per node. It is much more efficient with CPU and RAM by leveraging query optimizations including indexing and native semi-structured data storage. It is more efficient with SSD by only storing the needed data instead of fetching and caching entire micro-partitions. It also allows you to choose the best instances and combination of CPU, RAM, and SSD to optimize price-performance for each type of ingestion and query workload. You can choose to have a small cluster with massive nodes, or a massive cluster with small nodes depending on the type of computing. Both can result in lower costs for specific workloads than the standard virtual warehouse sizes with Snowflake.
In one customer benchmark, a Firebolt client compared Firebolt vs Snowflake across 5 real-world analytical queries over a 0.5 TB data set:
Disclaimer - This is not a global benchmark. The results are based on real world queries and run-times as reported by our users over Snowflake, and their equivalent run-times over the same data in Firebolt after tuning and optimization
The combination of 10x greater efficiency per node, and choice of instance types and number for each engine has enabled companies using Firebolt to deliver the same computing at 10x lower costs than with Snowflake. Users can run more queries, and get more value out of the data as a result because they no longer limit their usage for fear of blowing the budget.
Over the last decade, not only has the volume, variety and velocity of data changed, so has the use of analytics and data. Until recently, the most common use of data warehouses involved analysts and managers using reports and dashboards to analyze historical data. The data was typically extracted, loaded, and transformed (ETL) from applications into data warehouses.
Two big trends have completely changed analytics. The first was a shift away from centralized decision making by analysts and managers to real-time decisions by employees and customers, which requires self-service analytics that can analyze historical and near real-time information.
The second change was the explosive growth of Big Data, in part to support more real-time decisions. In 1992, Walmart was the first to reach a 1 terabyte (TB) data warehouse. Ten years ago, some data warehouses reached 1 petabyte (PB). This growth has been driven by the explosive growth of newer types of data - including streams of data about connected customers, devices and applications.
Today most companies have the following types of analytics
There could be a host of different operational analytics systems from IT monitoring and network telemetry, to customer-facing analytics that a company sells to their end customers about their products, anything from financial assets, advertisements, automobiles to sports or games. Some operational analytics are even delivered as a (customer-facing) service by SaaS vendors.
Snowflake is by design a data warehouse as a service. It was created nearly a decade ago to help companies move existing traditional data warehouse workloads into the cloud. In short, that means Snowflake supports BI reporting and daily dashboard use cases really well. It has enabled companies to move these traditional analytics workloads into the cloud, and outsource their data warehouse infrastructure and infrastructure management.
But Snowflake is as not good for:
It can also be a costly solution to use for high concurrency (user and query) workloads given its costs as it scales (see the cost section.)
This means that while Snowflake is better suited for traditional analytics, it is not as well suited for ad hoc, big data, operational and customer-facing analytics, the same workloads that helped push existing data warehouses beyond their limits the past decade.
Firebolt is a 3rd generation data warehouse, built over the last few years, that is by design meant to address several of these more recent analytics challenges and use cases. Firebolt is designed for:
While Firebolt can address reporting and dashboard use cases well, where it shines relative to Snowflake is in the other use cases.
While Snowflake decoupled storage and computing, which in turn simplified scalability and administration, and also helped move traditional data warehouse workloads to the cloud, it has not addressed the newer analytics needs driven by the rise of big data and the need for real-time responsiveness in businesses today.
The promise of the cloud has always been to bring not just lower costs, but also the latest innovations into companies. 3rd generation data warehouses like Firebolt have added innovations that improve performance and cost by an order of magnitude. Now companies can support true ad hoc, high performance, big data, operational and customer-facing analytics and data applications at scale.
Taking advantage of Firebolt does not mean you have to replace Snowflake. Companies that already used Snowflake simply added Firebolt as another cloud data warehouse for these newer use cases where Snowflake is not working for them. Today’s modern data pipelines and data lakes have made adding another cloud warehouse relatively straightforward.