Choosing the right cloud data warehouse and analytics infrastructure can be confusing. Over the last decade, the number of options have exploded:
Making the right choice partly requires a deeper understanding into how these data warehouses and query engines compare. But it also requires a deeper understanding of what your own analytics needs are today, and what they will be over the next few years. This evolution in cloud data warehousing has been driven by major changes in analytics needs. Today most companies have the following types of analytics:
This whitepaper compares the major cloud data warehouse and query engine options across more than 30 different categories including their overall architectures and core features, scalability and performance, as well as their suitability across these different analytics use cases.
Before big data, analytics centered around on-premises data warehouse leaders, including Oracle, Teradata, Netezza (IBM), Vertica (HP), ParAccel (Actian) and GreenPlum (VMWare). Management and business intelligence (BI) teams relied on batch loads from various applications to build reports or populate dashboards for decision making. By 2010, many deployments were starting to hit their limits for several reasons:
Hadoop gained popularity as a solution for storing and analyzing massive amounts of data. But it was batch-based and complicated to manage. It could not support live, continuous ingestion or fast query performance, or ad hoc analytics. In the meantime, companies also started to look to the cloud to help lower costs.
Amazon Redshift was one of the first, and originally the most successful data warehouse as a cloud service. It is a cloud service that was designed to shift “Red” (Oracle) and other data warehouse deployments to the cloud. Amazon accomplished this by building Redshift using a version of ParAccel, which is based on PostgreSQL.
While Redshift has definitely adopted many of the advantages of cloud services such as simplified management, it has also kept several of the limitations of traditional data warehouses:
Snowflake and Google BigQuery were two of the first cloud data warehouses to separate storage and compute, and add elastic scalability.
Snowflake was the first modern cloud data warehouse with a decoupled storage and compute architecture. It provides you with dedicated compute resources that run within Snowflake’s cloud environment. The core computing unit is a virtual data warehouse (or warehouse). Once you provision a warehouse of 1, 2, 4, 8, 16, 32, 64, or 128 nodes (or servers), and assign users to it, the warehouse pulls data from storage as needed by the queries and stores it on local storage, which it refers to as a cache. To scale queries, you can manually provision a new, larger warehouse in seconds. To scale users, a multi-clustering option, only available with the Enterprise edition and higher, can automatically provision and deprovision up to 10 identical warehouses and assign users across them.
When compared to Redshift, Snowflake does simplify both scalability and administration. But many of the same challenges still remain:
Google BigQuery was released before Snowflake as a serverless query engine with decoupled storage and compute. So why not call it the first modern cloud data warehouse? The short answer is that like other great technologies out of Google, it is a unique piece of engineering that is not a typical data warehouse. It was a serverless on demand query engine that also became a data warehouse over time. That is both one of its biggest strengths and weaknesses.
BigQuery started as a public implementation of several internal technologies, which you can read about in the official BigQuery whitepaper. The distributed query engine is based on Dremel. BigQuery uses a columnar store file system called Colossus that stores data in Capacitor with a format that allows BigQuery to operate directly on compressed data. Dremel distributes queries to an execution tree of leaves (slots) and a hierarchy of mixers similar to MapReduce that aggregate results. For each query, Borg, which is Google’s cluster management system, dynamically provisions the server resources (slots). The compute relies on Jupiter, a petabit (yes 1 million gigabit) network for accessing all data. As each task completes the slot, it moves the results into shared memory for the next task (e.g. a Mixer) to pick it up. BigQuery will shuffle to repartition data for the next steps similar to MapReduce. If data outgrows memory or becomes too expensive, it can shuffle to disk (Colossus.)
Today you can run BigQuery on demand, with reserved slots that you run 24x7, or with flex slots that you can run for seconds. On demand, which costs $5/TB of uncompressed data scanned, is great for running the occasional analytics or for queries where you’re not scanning a lot. But on demand can quickly cost a fortune if you run queries regularly, which is why most data warehouse customers now buy reserved slots at $2000 a month per 100 or $1700 a month for an annual commitment. More recently, Google also released flex slots, which are priced at $4/hour for 100 slots (a 44% premium over monthly reserved.) The combination of reserved and flex slots makes BigQuery price-competitive with Snowflake.
Unlike Snowflake, BigQuery supports streaming ingestion at scale. They accomplish this with an architecture similar to Vertica that is also used by a few other specialized engines. They ingest one message (row) at a time into row-based memory and then eventually commit to columnar storage.
BigQuery, like Snowflake, supports any scale and has even less administrative overhead as a serverless query engine. Despite the impressive engineering feats as a dynamically provisioned query engine, BigQuery still has several challenges as a data warehouse:
Presto was originally developed by Facebook to query data in Hadoop. Facebook needed a query engine that was faster than Hive and could operate across Hadoop and other data. They ended up creating one of the first high performance federated query engines that separated storage and compute, and was able to query existing data “as is”. Since Facebook contributed it to open source, Presto has become one of the most popular open source SQL query engines on top of data lakes or simple storage.
Amazon Athena takes Presto as its foundation, and turns it into a serverless SQL query cloud service. It is a multi-tenant service of pooled resources that handles all requests across all customers. The data is stored in S3. Presto (and Athena) acts as a federated query engine where queries are broken down into tasks executed in a cluster or pushed down to target databases. Redshift Spectrum, while implemented differently from Athena, gives Redshift customers integrated federated querying.
One of Presto’s key strengths, the ability to federate queries across different data sources, is also one of its weaknesses when it comes to improving performance or simplifying administration. You need to implement your own data pipeline independently of Presto, and then configure Presto to work against the data. While Presto can take advantage of optimized storage formats such as Parquet, it is not out of the box. You can also push down queries to execute locally within databases and data warehouses such as Redshift. But Presto is not optimized for performance with storage. It also does not implement indexing, sorting, materialized views, caching, or other performance-related features.
For most companies, the main reasons for adopting cloud technologies have always been to lower costs and adopt the latest innovations. While 1st generation data warehouses focused on delivering an easier-to-deploy cloud service, and the 2nd generation focused elastic scalability and ease of administration, 3rd generation data warehouses are focused on using some of the latest innovations in analytics and cloud over the last few years to deliver much faster query performance and a much lower total cost of ownership (TCO) at any scale.
The major limitations to performance are in the separation of storage and compute, and query engine design. The first time data is needed, it has to be fetched from remote storage over the network, and in most distributed systems with large data sets, the network is the first bottleneck. A 10GigE network can transport roughly 1 GB per second fully loaded, if you are lucky. A terabyte will take over 16 minutes. Even with a network that is 10-100x faster, queries need to be optimized to pull as little data as possible from cold storage. Most cloud data warehouses today pull much larger sets. Snowflake, for example, will prune out micro-partitions by keeping track of the min-max data ranges in each micro-partition. But each micro-partition is 50-150MB.
The second major performance bottleneck is the query processing itself. For one, most cloud data warehouses do not take advantage of some traditional performance techniques, such as indexing. Several data warehouses such as Athena and Snowflake also do not expose a lot of tuning options. But there are also some more recent optimization techniques that have been refined in the last few years, from cost-based optimization to just-in-time (JIT) compilation for distributed queries.
The major bottleneck in cost is pricing. Every incremental query costs money, and big, complex queries cost a lot of money. If you look at the Fivetran benchmark, which managed 1TB of data most of the clusters cost $16 per hour. That was the enterprise ($2 per credit) pricing for Snowflake. Running business-critical or Virtual Private Snowflake (VPS) would be $4 or more per credit. Running it full-time with 1TB of data would be roughly $300,000 per year at list price. This enormous cost forces many IT organizations to throttle Snowflake usage, often down to individual queries, and prune data. But pruning data and restricting usage goes against the very goal of the democratization of data and analytics.
Firebolt is the first 3rd generation cloud data warehouse that has added a number of innovations to improve performance and lower TCO. Some of these innovations include:
The combination of features has delivered 4-6000x faster performance in customer banchmarks, across queries and sub-second performance. Note in these customer benchmarks that while price-performance is 10x or greater, the total cost savings are less, only up to 10x lower at most. This is because most customers improve performance 4-6000x across queries compared to the previous solution.
What follows is a detailed side-by-side comparison of the main options across the 1st, 2nd, and 3rd generation data warehouses and query engines: Amazon Redshift, Amazon Athena, Snowflake and Firebolt. The comparison spans more than 30 categories, including:
The detailed comparison of Redshift, Athena, Snowflake, BigQuery, and Firebolt across architecture, scalability, performance, use cases and cost of ownership highlights the following major differences:
In short, some data warehouses are better for different use cases. Using this information will hopefully help you choose the right data warehouse or query engine for different use cases. It may also help you prepare your analytics infrastructure so that these choices can be made.
The combination of today’s modern data pipelines and data lakes with the simplicity of cloud services have made adding another cloud warehouse relatively straightforward and cost effective. Those companies that were already using a cloud data warehouse for reporting and dashboards were able to add Firebolt as another cloud data warehouse in weeks, and use it for ad hoc, semi-structured data, operational and customer-facing analytics, while leaving the existing analytics in place.
If you do put a modern data pipeline in place that does make it easier to redirect your data from a data lake or other “single source of the truth” then you will be able to choose the best combination of price and performance for each analytics need.
Firebolt is the world’s fastest cloud data warehouse, purpose-built for delivering a new grade of analytic experiences over big data. It provides tech companies with a modern data stack, and orders of magnitude faster query performance at a fraction of the cost of the alternatives by combining the simplicity, elasticity and low cost of the cloud with the latest innovations in analytics. Engineering teams that adopted Firebolt have been able to deploy data applications and interactive analytics over big data in weeks while delivering sub-second performance at terabyte to petabyte scale.
Firebolt is founded by members of the original team at Sisense, A global BI leader, and backed by Zeev Ventures, TLV Partners, Bessemer Venture Partners and Angular Ventures.