Choosing the right data warehouse and analytics infrastructure on Amazon Web Services (AWS) can be confusing. Over the last decade, the number of options have exploded.
- Hybrid cloud data warehouses which evolved from their on premises roots such as Oracle, Teradata, Vertica, or Greenplum, built for BI teams for reporting and dashboards
- 1st generation cloud data warehouses, such as RedShift, which was built on ParAccel, that helped “Shift Red” (Oracle) and other data warehouse deployments by porting data warehouse technology to the cloud and simplifying deployment
- 2nd generation data warehouses, such as Snowflake, which improved scalability by separating storage and compute, and simplified administration
- 2nd generation query engines such as Presto, or Amazon Athena, which provided federated query engines on top of multiple data sources
- 3rd generation data warehouses, such as Firebolt, which improved performance and lowered costs by adding the latest innovations in data warehousing, more computing choices and control, and different pricing models
Making the right choice partly requires a deeper understanding how these data warehouses and query engines compare. But it also requires a deeper understanding of what your own analytics needs are today, and what they will be over the next few years. This evolution in cloud data warehousing has been driven by major changes in analytics needs:
Today most companies have the following types of analytics
- Reporting by analysts against historical data
- Dashboards created by analysts against historical or live data
- Ad hoc analytics within dashboards or other tools for on-the-fly interactive analysis
- High performance analytics of very large or complex queries with massive data sets
- Big data analytics using semi-structured or unstructured data
- Data processing used as part of a data pipeline to deliver data downstream
- Machine or deep learning to train models against data in data lakes or warehouses
- Operational analytics needed by much larger groups of employees to help them make better, faster decisions on their own
- Customer-facing analytics delivered as (paid) service-service analytics to customers
This blog compares the major cloud data warehouse and query engine options on AWS including their overall architectures and core features, scalability, performance, and cost, as well as their suitability across these different analytics use cases.
In the beginning - the challenges of on premises data warehouses
Before Big Data, analytics centered around on-premises data warehouse leaders, including Oracle, Teradata, Netezza (IBM), Vertica (HP), ParAccel (Actian) and GreenPlum (VMWare). Management and business intelligence (BI) teams relied on batch loads from various applications to build reports or populate dashboards for decision making.
By 2010, many deployments were starting to hit their limits for several reasons:
- The rise of Big Data: Companies overwhelmed their data warehouses as they tried to analyze massive volumes of both existing and newer types of data.
- The need for ad hoc analytics: More and more, employees and end customers wanted direct access to ad hoc analytics with sub-second performance to help make decisions.
- The need for live data: Employees and customers also increasingly wanted or needed to react to and analyze live data on their own instead of relying on analysts who work mostly with historical data.
Hadoop gained popularity as a solution for analyzing massive amounts of data. But it was batch-based and complicated to manage. It could not support live, continuous ingestion or fast query performance.
In the meantime, companies also started to look to the cloud to help lower costs.
1st Generation Cloud Data Warehouses - Amazon Redshift
Amazon Redshift was the first, and the most successful data warehouse as a cloud service. It is a cloud service that was designed to shift “Red” (Oracle) and other data warehouse deployments to the cloud. Amazon accomplished this by building Redshift using a version of ParAccel, which is based on PostgreSQL.
While RedShift has definitely adopted many of the advantages of cloud services such as simplified management, it has also kept several of the limitations of traditional data warehouses:
- Limited scalability: RedShift still scales like PostgreSQL. You can only scale storage and queries within a single data warehouse. While you can add identically-sized read-only copies of a warehouse via replication (concurrency scaling), you can still only have one “master” warehouse for writing and it does require a lot of configuration and management. While RedShift introduced new RA3 nodes that separate storage and compute, they don’t provide elastic scalability. RedShift Spectrum provides federated querying with pushdown to additional Spectrum nodes and target data sources, but scalability is limited by the size of the RedShift cluster.
- Low query concurrency: RedShift can only queue up to 50 queries across all all queues. While that works for internal analyst teams and traditional BI, it does not work for operational or customer-facing analytics, which have much larger user groups.
- Performance: RedShift takes seconds to minutes for the first time a query is compiled and run, such as for ad hoc queries (see the FiveTran benchmark). Repeatable query workloads such as reporting and dashboards can be fast on average because RedShift uses a result cache to save the results for repeated queries.
- Continuous (streaming) data ingestion: Redshift is still optimized for batch, not continuous ingestion because of features like table-level locking for writes it inherited from the original ParAccel and PostgreSQL architecture.
- Price: RedShift charges a markup on all the computing and storage, which can make it expensive very quickly as you add more nodes to increase performance.
2nd Generation Cloud Data Warehouses - Snowflake
Snowflake was one of the first cloud data warehouse vendors to separate storage and compute, and add elastic scalability. The core computing unit is a virtual data warehouse (or warehouse). Once you provision a warehouse of 1, 2, 4, 8, 16, 32, 64, or 128 nodes, and assign users to it, the warehouse pulls data from storage as needed by the queries and stores it on local storage, which it refers to as a cache. To scale queries, you can manually provision a new, larger warehouse in seconds. To scale users, a multi-clustering option, only available with the Enterprise edition and higher, can automatically provision and deprovision up to 10 identical warehouses and assign users across them.
When compared to RedShift, Snowflake does simplify both scalability and administration. But many of the same challenges still remain:
- Expensive query scalability: Queries are scaled within a single warehouse, by assigning tasks across nodes. But certain tasks, such as semi-structured data or complex join operations, can easily overwhelm a node. The only way to increase a node size is to choose a larger warehouse. But that doubles the number of nodes and the cost with each step up
- Expensive user concurrency: While multi-clustering does only keep additional warehouses running while they’re needed, it is expensive because you have to pay for an entire additional cluster to support as little as one incremental user
- Performance: Snowflake takes seconds to minutes in various benchmarks (see the FiveTran benchmark) for the first time a query is run, such as for ad hoc queries. Repeatable query workloads such as reporting and dashboards can be fast on average because Snowflake caches data to local storage the first time it is accessed and uses a result cache to save all results
- Continuous (streaming) data ingestion: Snowflake is designed more for batch than for continuous ingestion. Snowflake implements table-level locks and has a limit of 20 DML writes in a queue per table. When performing batch writes, Snowflake recommends minimum batch intervals of 1 minute
- Pricing based on AWS compute and storage: Snowflake charges based on the AWS computing and storage it deploys. This creates a conflict of interest between Snowflake and its customers because making Snowflake nodes and warehouses faster or more efficient would mean less money for Snowflake.
2nd Generation query engines - Presto and Amazon Athena
Presto was originally developed by Facebook to query data in Hadoop. Facebook needed a query engine that was faster than Hive and could operate across Hadoop and other data. They ended up creating one of the first high performance federated query engines that separated storage and compute, and was able to query existing data “as is”. Since Facebook contributed it to open source, Presto has become one of the most popular open source SQL query engines on top of data lakes or simple storage.
Amazon Athena takes Presto as its foundation, and turns it into a serverless SQL query cloud service. It is a multi-tenant service of pooled resources that handles all requests across all customers. The data is stored in S3. Presto (and Athena) acts as a federated query engine where queries are broken down into tasks executed in a cluster or pushed down to target databases. Athena is also the foundation for RedShift Spectrum.
One of Presto’s key strengths, the ability to federate queries across different data sources, is also one of its weaknesses when it comes to improving performance or simplifying administration. You need to implement your own data pipeline independently of Presto, and then configure Presto to work against the data. While Presto can take advantage of optimized storage formats such as Parquet, it is not out of the box. You can also push down queries to execute locally within databases and data warehouses such as RedShift. But Presto is not optimized for performance with storage. It also does not implement indexing, sorting, materialized views, caching, or other performance-related features.
- Throttled user concurrency: Athena by default does not allow more than 20 concurrent active queries per AWS region
- Performance: Athena can be brought inline with Redshift, Snowflake or Google BigQuery performance, which is still second or minute, not sub-second performance at larger scale. Like BigQuery with on demand resources, the performance is also not as predictable because Athena used shared computing resources
- Continuous (streaming) data ingestion: Athena is not involved with data preparation as a query engine. But the best performance relies on a compressed columnar format that does not support continuous writes. It needs to be done in (micro-) batch
- Charges a markup on data scanned: Athena charges $5 per TB scanned for queries, which can make querying large data sets expensive over time
3rd Generation Data Warehouses - Firebolt
For most companies, the main reasons for adopting cloud technologies have always been to lower costs and adopt the latest innovations. While 1st generation data warehouses focused on delivering an easier-to-deploy cloud service, and the 2nd generation focused elastic scalability and ease of administration, 3rd generation data warehouses are focused on using some of the latest innovations in analytics and cloud over the last few years to deliver much faster query performance and a much lower total cost of ownership (TCO) at any scale.
The major limitations to performance are in the separation of storage and compute, and query engine design. The first time data is needed, it has to be fetched from remote storage over the network, and in most distributed systems with large data sets, the network is the first bottleneck. A 10GigE network can transport roughly 1 GB per second fully loaded, if you are lucky. A terabyte will take over 16 minutes. Even with a network that is 10-100x faster, queries need to be optimized to pull as little data as possible from cold storage. Most cloud data warehouses today pull much larger sets. Snowflake, for example, will prune out micro-partitions by keeping track of the min-max data ranges in each micro-partition. But each micro-partition is 50-150MB.
The second major performance bottleneck is the query processing itself. For one, most cloud data warehouses do not take advantage of some traditional performance techniques, such as indexing. Several data warehouses such as Athena and Snowflake also do not expose a lot of tuning options. But there are also some more recent optimization techniques that have been refined in the last few years, from cost-based optimization to just-in-time (JIT) compilation for distributed queries.
The major bottleneck in cost is pricing, for two reasons. First, every incremental query costs money, and big, complex queries cost a lot of money. If you look at the FiveTran benchmark, which managed 1TB of data most of the clusters cost $16 per hour. That was the enterprise ($2 per credit) pricing for Snowflake. Running business-critical or Virtual Private Snowflake (VPS) would be $4 or more per credit. Running it full-time with 1TB of data would be roughly $300,000 per year at list price. This enormous cost forces many IT organizations to throttle Snowflake usage, often down to individual queries, and prune data. But pruning data and restricting usage goes against the very goal of the democratization of data and analytics.
The second issue with pricing is the actual pricing structure. Redshift and Snowflake, for example, charge for computing where they build in a markup on their cost for AWS infrastructure. This creates a conflict of interest with customers because these vendors make money by selling more nodes. The main way to improve performance is to buy larger clusters. If Redshift or Snowflake improve performance and efficiency on each node, their customers would buy less, and they would lose money.
Firebolt is the first 3rd generation cloud data warehouse that has added a number of innovations and changed pricing to improve performance and lower TCO. Some of these innovations include:
- Optimized Firebolt File Format (F3) storage that minimizes data access over the network and combines the best of columnar storage with with support for continuous updates and native semi-structured data storage to deliver high performance access
- Indexing on top of F3 to achieve sub-second performance. This includes sparse Indexing with sorting and clustering to help prune which partitions and nodes to include in a query; aggregating indexing for fast group by operations; and join indexing to speed up joins without materialized views
- A new, next generation query engine that is designed for multi-workload performance. Features include vectorized processing, JIT compilation, cost-based optimization, indexing, and a host of tuning options to improve query performance and prioritization
- Native semi-structured data support that allows any combination of flattening and nested storage with native Lambda-style functions in SQL for high performance analytics
- Tuning has in the past been hidden to make administration easier. This is changing. Now some vendors are allowing choice of instance types (CPU, RAM, SSD), data formats and partitioning, query prioritization, indexes, sorting and clustering
- Choice of engine and node types to optimize performance for ETL or querying, and for different types of workloads. Unlike Snowflake, you can explicitly provision an engine as a small number of very large AWS instance types
- Pricing where Firebolt charges a subscription based only on the amount of data you store. There is no markup on computing; customers only pay the AWS instance price for each running node. In addition, Firebolt supports spot instances, which can be 20%-30% of on demand instance prices.
The combination of features has delivered up to 182x faster performance in benchmarks, and demonstrated sub-second performance at petabyte scale. Firebolt has also shown up to 10x lower cost and 30x price-performance advantage over Snowflake. One reason is that Firebolt is much more efficient on the same hardware. But the second reason is that Firebolt does not charge any markup on AWS resources and supports spot instances as well.