Table of contents
2. The evolution of data warehousing - from on premises to the cloud
- In the beginning - the challenges of on premises data warehouses
- 1st generation cloud data warehouses - Amazon Redshift
- 2nd generation cloud data warehouses - Snowflake, BigQuery
- 2nd generation query engines - Presto and Amazon Athena
- 3d generation data warehouse - Firebolt
3. Comparison table - Redshift, Athena, Snowflake, Firebolt
5. About Firebolt
Choosing the right cloud data warehouse and analytics infrastructure can be confusing. Over the last decade, the number of options have exploded:
- Hybrid cloud data warehouses which evolved from their on premises roots such as Oracle, Teradata, Vertica, or Greenplum, built for BI teams for reporting and dashboards
- 1st generation cloud data warehouses, such as Redshift, which was built on ParAccel, that helped “Shift Red” (Oracle) and other data warehouse deployments by porting data warehouse technology to the cloud and simplifying deployment
- 2nd generation data warehouses, such as Snowflake and BigQuery, which improved scalability by separating storage and compute, and simplified administration
- 2nd generation query engines such as Presto, or Amazon Athena, which provided federated query engines on top of multiple data sources
- 3rd generation data warehouses, such as Firebolt, which improved performance and lowered costs by adding the latest innovations in data warehousing, more computing choices and control, as well as different pricing models
Making the right choice partly requires a deeper understanding into how these data warehouses and query engines compare. But it also requires a deeper understanding of what your own analytics needs are today, and what they will be over the next few years. This evolution in cloud data warehousing has been driven by major changes in analytics needs. Today most companies have the following types of analytics:
- Reporting by analysts against historical data
- Dashboards created by analysts against historical or live data
- Ad hoc analytics within dashboards or other tools for on-the-fly interactive analysis
- High performance analytics of very large or complex queries with massive data sets
- Big data analytics using semi-structured or unstructured data
- Data processing used as part of a data pipeline to deliver data downstream
- Machine or deep learning to train models against data in data lakes or warehouses
- Operational analytics needed by much larger groups of employees to help them make better, faster decisions on their own
- Customer-facing analytics delivered as (paid) self-service analytics to customers
This whitepaper compares the major cloud data warehouse and query engine options across more than 30 different categories including their overall architectures and core features, scalability and performance, as well as their suitability across these different analytics use cases.
The evolution of data warehousing - from on premises to the cloud
In the beginning - the challenges of on premises data warehouses
Before big data, analytics centered around on-premises data warehouse leaders, including Oracle, Teradata, Netezza (IBM), Vertica (HP), ParAccel (Actian) and GreenPlum (VMWare). Management and business intelligence (BI) teams relied on batch loads from various applications to build reports or populate dashboards for decision making. By 2010, many deployments were starting to hit their limits for several reasons:
- The rise of big data: Companies overwhelmed their data warehouses as they tried to analyze massive volumes of both existing and newer types of data.
- The need for ad hoc analytics: More and more, employees and end customers wanted direct access to ad hoc analytics with sub-second performance to help make decisions.
- The need for live data: Employees and customers also increasingly wanted or needed to react to and analyze live data on their own instead of relying on analysts who work mostly with historical data.
Hadoop gained popularity as a solution for storing and analyzing massive amounts of data. But it was batch-based and complicated to manage. It could not support live, continuous ingestion or fast query performance, or ad hoc analytics. In the meantime, companies also started to look to the cloud to help lower costs.
1st generation cloud data warehouses - Amazon Redshift
Amazon Redshift was one of the first, and originally the most successful data warehouse as a cloud service. It is a cloud service that was designed to shift “Red” (Oracle) and other data warehouse deployments to the cloud. Amazon accomplished this by building Redshift using a version of ParAccel, which is based on PostgreSQL.
While Redshift has definitely adopted many of the advantages of cloud services such as simplified management, it has also kept several of the limitations of traditional data warehouses:
- Limited scalability: Redshift still scales like PostgreSQL. You can only scale storage and queries within a single data warehouse. While you can have up to 10 identically-sized read-only copies of a warehouse via replication (concurrency scaling), you can still only have one “master” warehouse for writing and it does require a lot of configuration and management. Redshift introduced new RA3 nodes that separate storage and compute, but they don’t provide elastic scalability. In addition, Redshift Spectrum provides federated querying with pushdown to additional Spectrum nodes and target data sources, but scalability is limited by the size of the Redshift cluster.
- Low query concurrency: Even with concurrency scaling, Redshift only supports up to 50 concurrent queries by default. While that works for internal analyst teams and traditional BI, it does not work for operational or customer-facing analytics, which have much larger user groups.
- Performance: Redshift takes seconds to minutes for the first time a query is compiled and run, such as for ad hoc queries (see the Fivetran benchmark). Repeatable query workloads such as reporting and dashboards can be fast on average because Redshift uses a result cache to save the results for repeated queries.
- Continuous (streaming) data ingestion: Redshift is still optimized for batch, not continuous ingestion because of features like table-level locking for writes it inherited from the original ParAccel and PostgreSQL architecture.
- Price: Redshift charges a markup on all the computing and storage. Costs increase the most as you add more nodes to increase performance. It is also more challenging to scale since it is not fully decoupled storage and compute. Companies often start with larger clusters to support future growth, which also increases costs.
2nd generation cloud data warehouses - Snowflake, BigQuery
Snowflake and Google BigQuery were two of the first cloud data warehouses to separate storage and compute, and add elastic scalability.
Snowflake was the first modern cloud data warehouse with a decoupled storage and compute architecture. It provides you with dedicated compute resources that run within Snowflake’s cloud environment. The core computing unit is a virtual data warehouse (or warehouse). Once you provision a warehouse of 1, 2, 4, 8, 16, 32, 64, or 128 nodes (or servers), and assign users to it, the warehouse pulls data from storage as needed by the queries and stores it on local storage, which it refers to as a cache. To scale queries, you can manually provision a new, larger warehouse in seconds. To scale users, a multi-clustering option, only available with the Enterprise edition and higher, can automatically provision and deprovision up to 10 identical warehouses and assign users across them.
When compared to Redshift, Snowflake does simplify both scalability and administration. But many of the same challenges still remain:
- Expensive query scalability: Queries are scaled within a single warehouse, by assigning tasks across nodes. But certain tasks, such as semi-structured data or complex join operations, can easily overwhelm a node by requiring more RAM. The only way to increase a node size is to choose a larger warehouse. But that doubles the number of nodes and the cost with each step up.
- Expensive user concurrency: While multi-clustering does only keep additional warehouses running while they’re needed, it is expensive because you have to pay for an entire additional cluster to support as little as one incremental user.
- Performance: Snowflake takes seconds to minutes in various benchmarks (see the Fivetran benchmark) for the first time a query is run, such as for ad hoc queries. Repeatable query workloads such as reporting and dashboards can be fast on average because Snowflake caches data to local storage the first time it is accessed and uses a result cache to save all results.
- Continuous (streaming) data ingestion: Snowflake is designed more for batch than for continuous ingestion. Snowflake implements table-level locks and has a limit of 20 DML writes in a queue per table. When performing batch writes, Snowflake recommends minimum batch intervals of 1 minute.
- Pricing for compute: Snowflake charges based on the AWS compute and storage it deploys. While there is essentially no markup on storage, there is on compute, and it can get costly as you increase warehouse sizes to improve performance. For example in the Fivetran benchmark running the Snowflake warehouse 24x7 would be $150K-300K annually for the 1TB deployment
Google BigQuery was released before Snowflake as a serverless query engine with decoupled storage and compute. So why not call it the first modern cloud data warehouse? The short answer is that like other great technologies out of Google, it is a unique piece of engineering that is not a typical data warehouse. It was a serverless on demand query engine that also became a data warehouse over time. That is both one of its biggest strengths and weaknesses.
BigQuery started as a public implementation of several internal technologies, which you can read about in the official BigQuery whitepaper. The distributed query engine is based on Dremel. BigQuery uses a columnar store file system called Colossus that stores data in Capacitor with a format that allows BigQuery to operate directly on compressed data. Dremel distributes queries to an execution tree of leaves (slots) and a hierarchy of mixers similar to MapReduce that aggregate results. For each query, Borg, which is Google’s cluster management system, dynamically provisions the server resources (slots). The compute relies on Jupiter, a petabit (yes 1 million gigabit) network for accessing all data. As each task completes the slot, it moves the results into shared memory for the next task (e.g. a Mixer) to pick it up. BigQuery will shuffle to repartition data for the next steps similar to MapReduce. If data outgrows memory or becomes too expensive, it can shuffle to disk (Colossus.)
Today you can run BigQuery on demand, with reserved slots that you run 24x7, or with flex slots that you can run for seconds. On demand, which costs $5/TB of uncompressed data scanned, is great for running the occasional analytics or for queries where you’re not scanning a lot. But on demand can quickly cost a fortune if you run queries regularly, which is why most data warehouse customers now buy reserved slots at $2000 a month per 100 or $1700 a month for an annual commitment. More recently, Google also released flex slots, which are priced at $4/hour for 100 slots (a 44% premium over monthly reserved.) The combination of reserved and flex slots makes BigQuery price-competitive with Snowflake.
Unlike Snowflake, BigQuery supports streaming ingestion at scale. They accomplish this with an architecture similar to Vertica that is also used by a few other specialized engines. They ingest one message (row) at a time into row-based memory and then eventually commit to columnar storage.
BigQuery, like Snowflake, supports any scale and has even less administrative overhead as a serverless query engine. Despite the impressive engineering feats as a dynamically provisioned query engine, BigQuery still has several challenges as a data warehouse:
- Default limits with BigQuery on-demand pricing: While there are many official limitations for BigQuery, they are mostly limits for on demand pricing because you can buy reserved slots and contact Google to get around just about every limit. This comparison highlights those limits (with an asterix *) that include 100 concurrent users, 1500 updates per table per day (or 1 update per table per minute average), 4 concurrent external queries, and 100,000 messages per second per table. But Google does offer to raise the limits and appears to have clear examples of going beyond all these limits with reserved resources.
- On demand for data warehouses: You might be able to pay less using on demand for your data warehouse. But it is very rare. For most companies with regular analytics loads, on demand leads to very high bills. It also leads to all the limitations imposed by Google to protect shared resources. In short, if you need a data warehouse, you should probably be buying annual reserved slots. This should give you competitive costs compared to Snowflake or Redshift, as seen in the Fivetran benchmark.
- Shared memory and shuffling: The combination of Dremel, Borg, shared memory, and shuffling is perfect for optimizing shared resource utilization for an on demand pricing. But all that data movement over the network and in and out of shared memory hurts performance. Even with reserved slots and workload management, workloads are not really isolated, and you have no control over memory. Performance can get even worse if you run out of shared memory and spill into storage. This architecture is part of the reason BigQuery ends up with the same performance as Snowflake or Redshift, which have a much slower network but cache data with compute.
2nd Generation query engines - Presto and Amazon Athena
Presto was originally developed by Facebook to query data in Hadoop. Facebook needed a query engine that was faster than Hive and could operate across Hadoop and other data. They ended up creating one of the first high performance federated query engines that separated storage and compute, and was able to query existing data “as is”. Since Facebook contributed it to open source, Presto has become one of the most popular open source SQL query engines on top of data lakes or simple storage.
Amazon Athena takes Presto as its foundation, and turns it into a serverless SQL query cloud service. It is a multi-tenant service of pooled resources that handles all requests across all customers. The data is stored in S3. Presto (and Athena) acts as a federated query engine where queries are broken down into tasks executed in a cluster or pushed down to target databases. Redshift Spectrum, while implemented differently from Athena, gives Redshift customers integrated federated querying.
One of Presto’s key strengths, the ability to federate queries across different data sources, is also one of its weaknesses when it comes to improving performance or simplifying administration. You need to implement your own data pipeline independently of Presto, and then configure Presto to work against the data. While Presto can take advantage of optimized storage formats such as Parquet, it is not out of the box. You can also push down queries to execute locally within databases and data warehouses such as Redshift. But Presto is not optimized for performance with storage. It also does not implement indexing, sorting, materialized views, caching, or other performance-related features.
- Throttled user concurrency: Athena by default does not allow more than 20 concurrent active queries per AWS region.
- Performance: Athena can be brought inline with Redshift, Snowflake or Google BigQuery performance, which is still second or minute, not sub-second performance at larger scale. Like BigQuery with on demand resources, the performance is also not as predictable because Athena used shared computing resources.
- Continuous (streaming) data ingestion: Athena is not involved with data preparation as a query engine. But the best performance relies on a compressed columnar format that does not support continuous writes. It needs to be done in (micro-) batch.
- Charges a markup on data scanned: Athena charges $5 per TB scanned for queries, which can make querying large data sets expensive over time.
3rd generation data warehouses - Firebolt
For most companies, the main reasons for adopting cloud technologies have always been to lower costs and adopt the latest innovations. While 1st generation data warehouses focused on delivering an easier-to-deploy cloud service, and the 2nd generation focused elastic scalability and ease of administration, 3rd generation data warehouses are focused on using some of the latest innovations in analytics and cloud over the last few years to deliver much faster query performance and a much lower total cost of ownership (TCO) at any scale.
The major limitations to performance are in the separation of storage and compute, and query engine design. The first time data is needed, it has to be fetched from remote storage over the network, and in most distributed systems with large data sets, the network is the first bottleneck. A 10GigE network can transport roughly 1 GB per second fully loaded, if you are lucky. A terabyte will take over 16 minutes. Even with a network that is 10-100x faster, queries need to be optimized to pull as little data as possible from cold storage. Most cloud data warehouses today pull much larger sets. Snowflake, for example, will prune out micro-partitions by keeping track of the min-max data ranges in each micro-partition. But each micro-partition is 50-150MB.
The second major performance bottleneck is the query processing itself. For one, most cloud data warehouses do not take advantage of some traditional performance techniques, such as indexing. Several data warehouses such as Athena and Snowflake also do not expose a lot of tuning options. But there are also some more recent optimization techniques that have been refined in the last few years, from cost-based optimization to just-in-time (JIT) compilation for distributed queries.
The major bottleneck in cost is pricing. Every incremental query costs money, and big, complex queries cost a lot of money. If you look at the Fivetran benchmark, which managed 1TB of data most of the clusters cost $16 per hour. That was the enterprise ($2 per credit) pricing for Snowflake. Running business-critical or Virtual Private Snowflake (VPS) would be $4 or more per credit. Running it full-time with 1TB of data would be roughly $300,000 per year at list price. This enormous cost forces many IT organizations to throttle Snowflake usage, often down to individual queries, and prune data. But pruning data and restricting usage goes against the very goal of the democratization of data and analytics.
Firebolt is the first 3rd generation cloud data warehouse that has added a number of innovations to improve performance and lower TCO. Some of these innovations include:
- Optimized Firebolt File Format (F3) storage that minimizes data access over the network and combines the best of columnar storage with support for continuous writes and native semi-structured data storage to deliver high performance access
- Indexing on top of F3 to achieve sub-second performance. This includes sparse indexes with sorting and clustering to help prune which partitions and nodes to include in a query; aggregating indexes for fast group by and other aggregation operations that replace materialized views; and join indexing to speed up joins
- A new, next generation query engine that is designed for multi-workload performance. Features include vectorized processing, JIT compilation, cost-based optimization, indexing, and a host of tuning options to improve query performance and prioritization
- Native semi-structured data support that allows any combination of flattening and nested storage with native Lambda functions in SQL for high performance analytics
- Choice of engine and node types to optimize performance for ETL or querying, and for different types of workloads. Unlike Snowflake, you can explicitly provision an engine as a small number of very large nodes. Firebolt allows you to choose any node type, and any number in each engine, to meet your exact needs
- 10x greater efficiency where Firebolt has demonstrated 10x or greater price-performance in released costumer benchmarks through greater engine efficiency and choice of the best resources for each engine
The combination of features has delivered 4-6000x faster performance in customer banchmarks, across queries and sub-second performance. Note in these customer benchmarks that while price-performance is 10x or greater, the total cost savings are less, only up to 10x lower at most. This is because most customers improve performance 4-6000x across queries compared to the previous solution.
Comparison: Redshift vs Athena vs BigQuery vs Snowflake vs Firebolt
What follows is a detailed side-by-side comparison of the main options across the 1st, 2nd, and 3rd generation data warehouses and query engines: Amazon Redshift, Amazon Athena, Snowflake and Firebolt. The comparison spans more than 30 categories, including:
- Architecture: elasticity, technology support, isolation, and security
- Scalability: elastic scale, scale efficiency, query and user concurrency
- Performance: query, indexing, data type and storage optimization
- Use cases: reporting, dashboards, ad hoc, operations and customer-facing analytics
The detailed comparison of Redshift, Athena, Snowflake, BigQuery, and Firebolt across architecture, scalability, performance, use cases and cost of ownership highlights the following major differences:
- Redshift, while it is arguably the most mature and feature-rich, is also the most like a traditional data warehouse in its limitations. This makes it the hardest to manage, and not as well suited for the newer use cases.
- Athena is arguably the easiest, least expensive and best suited for “one-off analytics”. But it is also the most limited, and requires you to manage your own (external) storage and ingestion very well, which is especially hard for continuous ingestion.
- BigQuery combines the best of Athena as a serverless data warehouse with the option to purchase reserved compute like Snowflake. While on-demand pricing is often too expensive for regular workloads, reserved and flex slots deliver pricing that is competitive with Redshift and Snowflake. But BigQuery comes with the limitations of a shared service as well, including extensive throttling and other limits designed to protect BigQuery customers from rogue queries. If you buy reserved or flex, be sure to remember you can remove these limits. Its query performance is also similar to Redshift and Snowflake, and is too slow for interactive analytics at scale.
- Snowflake as a more modern cloud data warehouse with decoupled storage and compute is easier to manage for reporting and dashboards, and delivers strong user scalability. It also runs on multiple clouds. But like the others, Snowflake does not deliver sub-second performance for ad hoc, interactive analytics at any reasonable scale, or support continuous ingestion well. It also can be expensive to scale for large data sets with complex queries or semi-structured data.
- Firebolt is the only data warehouse with decoupled storage and compute that supports ad hoc and semi-structured data analytics with sub-seconds performance at scale. It also combines simplified administration with choice and control over node types and greater price-performance that can deliver the lowest overall TCO. This makes it the best choice for ad hoc, high performance, operational and customer-facing analytics.
In short, some data warehouses are better for different use cases. Using this information will hopefully help you choose the right data warehouse or query engine for different use cases. It may also help you prepare your analytics infrastructure so that these choices can be made.
The combination of today’s modern data pipelines and data lakes with the simplicity of cloud services have made adding another cloud warehouse relatively straightforward and cost effective. Those companies that were already using a cloud data warehouse for reporting and dashboards were able to add Firebolt as another cloud data warehouse in weeks, and use it for ad hoc, semi-structured data, operational and customer-facing analytics, while leaving the existing analytics in place.
If you do put a modern data pipeline in place that does make it easier to redirect your data from a data lake or other “single source of the truth” then you will be able to choose the best combination of price and performance for each analytics need.
Firebolt is the world’s fastest cloud data warehouse, purpose-built for delivering a new grade of analytic experiences over big data. It provides tech companies with a modern data stack, and orders of magnitude faster query performance at a fraction of the cost of the alternatives by combining the simplicity, elasticity and low cost of the cloud with the latest innovations in analytics. Engineering teams that adopted Firebolt have been able to deploy data applications and interactive analytics over big data in weeks while delivering sub-second performance at terabyte to petabyte scale.
Firebolt is founded by members of the original team at Sisense, A global BI leader, and backed by Zeev Ventures, TLV Partners, Bessemer Venture Partners and Angular Ventures.