WHITEPAPER

A Comparison of Data Warehouse and Query Engines on AWS

Table of contents

1. Introduction
2. The evolution of data warehousing - from on premises to the cloud
3. Comparison table - Redshift, Athena, Snowflake, Firebolt
4. Summary
5. About Firebolt

Introduction

Choosing the right data warehouse and analytics infrastructure on Amazon Web Services (AWS) can be confusing. Over the last decade, the number of options have exploded:

Making the right choice partly requires a deeper understanding of how these data warehouses and query engines compare. But it also requires a deeper understanding of what your own analytics needs are today, and what they will be over the next few years.  This evolution in cloud data warehousing has been driven by major changes in analytics needs. Today most companies have the following types of analytics:

This whitepaper compares the major cloud data warehouse and query engine options on AWS across more than 30 different categories including their overall architectures and core features, scalability, performance, and cost, as well as their suitability across these different analytics use cases.

The evolution of data warehousing - from on premises to the cloud

In the beginning - the challenges of on premises data warehouses

Before big data, analytics centered around on-premises data warehouse leaders, including Oracle, Teradata, Netezza (IBM), Vertica (HP), ParAccel (Actian) and GreenPlum (VMWare). Management and business intelligence (BI) teams relied on batch loads from various applications to build reports or populate dashboards for decision making. By 2010, many deployments were starting to hit their limits for several reasons:

Hadoop gained popularity as a solution for analyzing massive amounts of data. But it was batch-based and complicated to manage.  It could not support live, continuous ingestion or fast query performance.
In the meantime, companies also started to look to the cloud to help lower costs.

1st generation cloud data warehouses - Amazon Redshift

Amazon Redshift was the first, and the most successful data warehouse as a cloud service. It is a cloud service that was designed to shift “Red” (Oracle) and other data warehouse deployments to the cloud. Amazon accomplished this by building Redshift using a version of ParAccel, which is based on PostgreSQL.

While Redshift has definitely adopted many of the advantages of cloud services such as simplified management, it has also kept several of the limitations of traditional data warehouses:

2nd generation cloud data warehouses - Snowflake

Snowflake was one of the first cloud data warehouse vendors to separate storage and compute, and add elastic scalability. The core computing unit is a virtual data warehouse (or warehouse). Once you provision a warehouse of 1, 2, 4, 8, 16, 32, 64, or 128 nodes, and assign users to it, the warehouse pulls data from storage as needed by the queries and stores it on local storage, which it refers to as a cache. To scale queries, you can manually provision a new, larger warehouse in seconds. To scale users, a multi-clustering option, only available with the Enterprise edition and higher, can automatically provision and deprovision up to 10 identical warehouses and assign users across them.

When compared to Redshift, Snowflake does simplify both scalability and administration. But many of the same challenges still remain:

2nd Generation query engines - Presto and Amazon Athena

Presto was originally developed by Facebook to query data in Hadoop. Facebook needed a query engine that was faster than Hive and could operate across Hadoop and other data. They ended up creating one of the first high performance federated query engines that separated storage and compute, and was able to query existing data “as is”. Since Facebook contributed it to open source, Presto has become one of the most popular open source SQL query engines on top of data lakes or simple storage. 

Amazon Athena takes Presto as its foundation, and turns it into a serverless SQL query cloud service. It is a multi-tenant service of pooled resources that handles all requests across all customers. The data is stored in S3. Presto (and Athena) acts as a federated query engine where queries are broken down into tasks executed in a cluster or pushed down to target databases. Athena is also the foundation for Redshift Spectrum.

One of Presto’s key strengths, the ability to federate queries across different data sources, is also one of its weaknesses when it comes to improving performance or simplifying administration. You need to implement your own data pipeline independently of Presto, and then configure Presto to work against the data. While Presto can take advantage of optimized storage formats such as Parquet, it is not out of the box. You can also push down queries to execute locally within databases and data warehouses such as Redshift. But Presto is not optimized for performance with storage. It also does not implement indexing, sorting, materialized views, caching, or other performance-related features. 

3rd generation data warehouses - Firebolt

For most companies, the main reasons for adopting cloud technologies have always been to lower costs and adopt the latest innovations. While 1st generation data warehouses focused on delivering an easier-to-deploy cloud service, and the 2nd generation focused elastic scalability and ease of administration, 3rd generation data warehouses are focused on using some of the latest innovations in analytics and cloud over the last few years to deliver much faster query performance and a much lower total cost of ownership (TCO) at any scale.

The major limitations to performance are in the separation of storage and compute, and query engine design. The first time data is needed, it has to be fetched from remote storage over the network, and in most distributed systems with large data sets, the network is the first bottleneck. A 10GigE network can transport roughly 1 GB per second fully loaded, if you are lucky. A terabyte will take over 16 minutes. Even with a network that is 10-100x faster, queries need to be optimized to pull as little data as possible from cold storage. Most cloud data warehouses today pull much larger sets. Snowflake, for example, will prune out micro-partitions by keeping track of the min-max data ranges in each micro-partition. But each micro-partition is 50-150MB.

The second major performance bottleneck is the query processing itself.  For one, most cloud data warehouses do not take advantage of some traditional performance techniques, such as indexing. Several data warehouses such as Athena and Snowflake also do not expose a lot of tuning options. But there are also some more recent optimization techniques that have been refined in the last few years, from cost-based optimization to just-in-time (JIT) compilation for distributed queries.

The major bottleneck in cost is pricing, for two reasons. First, every incremental query costs money, and big, complex queries cost a lot of money. If you look at the FiveTran benchmark, which managed 1TB of data most of the clusters cost $16 per hour. That was the enterprise ($2 per credit) pricing for Snowflake.  Running business-critical or Virtual Private Snowflake (VPS) would be $4 or more per credit. Running it full-time with 1TB of data would be roughly $300,000 per year at list price. This enormous cost forces many IT organizations to throttle Snowflake usage, often down to individual queries, and prune data. But pruning data and restricting usage goes against the very goal of the democratization of data and analytics.

The second issue with pricing is the actual pricing structure. Redshift and Snowflake, for example, charge for computing where they build in a markup on their cost for AWS infrastructure. This creates a conflict of interest with customers because these vendors make money by selling more nodes. The main way to improve performance is to buy larger clusters. If Redshift or Snowflake improve performance and efficiency on each node, their customers would buy less, and they would lose money.

Firebolt is the first 3rd generation cloud data warehouse that has added a number of innovations and changed pricing to improve performance and lower TCO. Some of these innovations include: 

The combination of features has delivered up to 182x faster performance in benchmarks, and demonstrated sub-second performance at petabyte scale. Firebolt has also shown up to 10x lower cost and 30x price-performance advantage over Snowflake. One reason is that Firebolt is much more efficient on the same hardware. But the second reason is that Firebolt does not charge any markup on AWS resources and supports spot instances as well.

Comparison - Redshift, Athena, Snowflake, Firebolt

What follows is a detailed side-by-side comparison of the main options on AWS across the 1st, 2nd, and 3rd generation data warehouses and query engines: Amazon Redshift, Amazon Athena, Snowflake and Firebolt. The comparison spans more than 30 categories, including:

1st generation
ARCHITECTURE (overall)
Scalability
2nd generation
3rd generation
2nd generation
Elasticity - separation of storage and compute
Supported cloud infrastructure
Tenancy - option for dedicated resources
Compute - instance types
Performance
Data - internal/external, writable storage
Security - data
Security - network
Elasticity - (individual) query scalability
Use cases
Cost
Elasticity - user (concurrent query) scalability
Write scalability - batch
Write scalability - continuous
Upserts
Data scalability
Query optimization - performance
Tuning - CPU, RAM, SSD, Storage
Ingestion performance
Indexes
Storage format
Ingestion latency
Partitioning
Caching
Semi-structured data - native JSON functions within SQL
Semi-structured data - native JSON storage type
Semi-structured data - performance
Reporting
Dashboards
Ad hoc
Operational or customer-facing analytics (high concurrency, continuously updating / streaming data)
Data processing engine (Exports or publishes data)
Administration - deployment, management
Choice - provision different cluster types on same data
Choice - provision different number of nodes
Choice - provision different instance types
Pricing - compute
Pricing - storage
Pricing - transferred data
Data science/ML
Yes (limited)
N/A (slow with columnar)
Yes (slow) micro-partition rewrites
Yes (fast)
AWS only
AWS only
AWS, Azure, Google Cloud
AWS only
Redshift RA3, Spectrum only
Yes (query engine, no storage)
Yes
Yes
Isolated tenant, use own VPC
Any size, and types
Multi-tenant pooled resources
N/A no choice
1-128 nodes, unknown types
1-128 choice of any types
Multi-tenant dedicated resources (Only VPS isolated)
Isolated tenancy Compute and storage
Internal writable, external (Spectrum)
External only
External tables supported (slower)
External tables (used for ingestion), supports updates
Isolated/VPC tenant for storage and compute
Encryption at rest, RBAC
Shared resources
Shared resources
Limited - 20 concurrent queries by default
Autoscale up to 10 warehouses. Limited to 20 DML writes in queue per table
Unlimited manual scaling
Up to 100 partitions per table, 100 buckets (default)
No specified storage limit. 4XL data warehouse (128 nodes). 16MB max field size
No limit
S3
Optimized micro-partition storage (S3), separate RAM
Optimized F3 storage (on S3). Data access integrated across disk, SSD, RAM)
Partition pruning
Micro-partition / pruning, cluster keys
Sparse indexing
Separate customer keys (only VPS is isolated tenant) Encryption at rest, RBAC
Firewall, SSL, PrivateLink whitelist/blacklist control, isolated for VPS only
Firewall and WAF, SSL, PrivateLink whitelist/blacklist control, isolated tenancy
Isolated tenant for storage and compute
Encryption at rest, RBAC
Firewall, SSL, PrivateLink whitelist/blacklist control, isolated/VPC tenancy
Manual
Automatic (but shared resources)
1-click cluster resize, no choice of node size
1-click cluster resize of EC2 type, number of nodes
Autoscale up to 10 clusters, 15 queries per cluster, 50 queued queries total, max.
1 master cluster
Limited (table-level locking)
N/A (mostly batch-centric storage)
Limited to 20 DML writes in queue per table
Multi-master continuous updates
N/A
Strong
Strong. Multi-master parallel batch
Only with RA3: 128 RA3 nodes, 8PB of data.
Limited cost- based optimization
Limited
Yes
Fixed view
Yes
Fixed view
Sec-min first-time query performance
Limited continuous writes and concurrency, limited semi- structured data support
Unload data as Parquet
Easy to provision, harder to configure and manage
No
Yes
Yes
(Limited)
Markup on AWS nodes
$0.25-13 per node on demand
<40% less for up front
Stored data only.
RA3: $24 per TB per month
S3: AWS S3 costs.

Spectrum: $5 per TB scanned (10MB min per query)
$5 per TB scanned (10MB per query)
None
None
N/A
(not part of Athena)
All storage.
$23/40 per TB per month on demand/up front
Stored data only.
1TB: free
1-5TB: $25K
5-200TB: $80K
200TB+: ask for quote
None
Markup on AWS nodes $2-$4+ per node
No markup, pay AWS list price
No
No
Yes
No
Yes
Yes
No
Yes
Yes
Invoke ML (SageMaker) in SQL
Export query results
Spark, Arrow, Python connectors, integration with ML tools, export query results
Export query results
No administration or tuning
Easy to deploy and resize. Strong performance visibility, limited tuning
Easy to deploy and resize
Easy to add indexing, change instance types
Exports query results
Export query results or table
Export query results
Limited continuous writes and concurrency, slow semi- structured data perf.
Limited continuous writes and concurrency, slow semi- structured data performance
Yes. Support continuous updates at scale, Kafka connector, fast semi-structured performance
Sec-min first-time query performance
Sec-min first-time query performance
Sub-sec first-time query performance
Fixed view
Fixed view
Fixed view, dynamic,
Changing data
Yes
Yes
Yes
Fixed view
Fixed view
Fixed view, dynamic,
Changing data
Yes
Yes
Yes
Yes (Lambda)
Yes
Yes (Lambda)
Choice of (limited) instance types
Batch-centric (table-level locking)
Batch-centric (table-level locking)
Batch-centric (micro-partition level locking, limit of 20 queued writes per table)
Batch write preferred (1+ minute interval). Requires rewrite of entire micro-partition
N/A (storage and ingestion separate)
N/A (storage and ingestion separate)
Multi-master, lock-free high performance ingestion with unlimited scale for batch or continuous ingestion
Immediately visible during ingestion
None.
None.
None.
Indices for data access, joins, aggregation
No choice of resources
Can only choose warehouse size not node types
Choice of AWS instance types CPU/RAM/SSD, tuning options
Limited cost- based optimization
Cost-based optimization, vectorization
Index- and cost-based optimization,vectorization JIT, pushdown optimization
Native Redshift storage (not Spectrum)
Distribution, sort keys
Result cache
None
Result cache, materialized view
F3 (cache) (Aggregate index, join index)
Redshift
1st generation
ARCHITECTURE (overall)
Athena
2nd generation

Firebolt
3rd generation
Snowflake
2nd generation
Elasticity - separation of storage and compute
Supported cloud infrastructure
Elasticity - separation of storage and compute
Control - provision different cluster types on same data
Performance
Control - provision different number of nodes
Control - provision different instance types
Write scalability - batch
Write scalability - continuous
Upserts
Use cases
Price
(keep high level)
Cloud
Semi-structured data - native JSON functions within SQL
Semi-structured data - native JSON storage type
Semi-structured data - performance
Data scalability
Indexes
Sorting (REMOVE FOR NOW)
Query optimization - performance
Query prioritization -
Predictability
(HOLD BACK ON ROW FOR A FEW WEEKS)
Storage format
Partitioning
Caching
Reporting
Dashboards
Ad hoc
Operational or customer-facing analytics (high concurrency, continuously updating / streaming data)
Data processing engine (Exports or publishes data)
Data science/ML
Compute
Storage
Transferred data
Supported cloud infrastructure
Tenancy - option for dedicated resources
Only with RA3: 128 RA3 nodes, 8PB of data.
Up to 100 partitions per table, 100 buckets (default)
No specified storage limit. 4XL data warehouse (128 nodes). 16MB max field size
No limit
AWS only
Limited - 20 concurrent queries by default
Autoscale up to 10 warehouses. Limited to 20 DML writes in queue per table.
Unlimited manual scaling
Redshift RA3, Spectrum only
Yes (query engine, no storage)
Yes
Yes
Redshift RA3, Spectrum only
No
Yes (query engine, no storage)
No
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
(Limited)
No
N/A
N/A (mostly batch-centric storage)
Yes (Lambda)
Yes
Yes (Lambda)
None.
None.
Sparse, aggregate, join indexes
None
Provision different virtual warehouses
Provision different engines, sizes, instance types. Thread-level optimization
N/A. Supports external formats
Optimized storage (S3)
Optimized F3 from storage to RAM
No
Strong
Strong. Multi-master parallel batch
Limited to 20 DML writes in queue per table
Multi-master continuous updates
Yes
1 master cluster
Limited (table-level locking)
Yes (limited)
N/A (slow with columnar)
Yes (slow) micro-partition rewrites
Yes (fast)
Limited
No
Slow (flattens JSON into table)
Slow (full load into RAM, full scan)
Slow (full load into RAM, full scan)
Fast (array operations without full scans)
No
Limited VARIANT (single field)
Yes (Nested array type, compact)
None.
Yes, and materialized views.
Result cache
Yes
Fixed view
Sec-min first-time query performance
Limited continuous writes. Limited concurrency
Unload data as Parquet
Invoke ML (SageMaker) in SQL
Markup on AWS nodes
$0.25-13 per node on demand
<40% less for up front
Stored data only.
RA3: $24 per TB per monthS3: AWS S3 costs.
Spectrum: $5 per TB scanned (10MB min per query)
AWS only
Isolated or multi-tenant (including Spectrum)
Multi-tenant / pooled resources
Multi-tenant dedicated resources (Only VPS isolated)
Isolated tenancy Compute and storage
AWS only
AWS, Azure, Google Cloud
AWS only
$5 per TB scanned (10MB per query)
None.
None.
Should we be aggressive on Azure, GCP
N/A
(not part of Athena)
All storage.
$23/40 per TB per month on demand/up front
Stored data only.
1TB: free1-5TB: $25K 5-200TB: $80K
200TB+: ask for quote
None.
Markup on AWS nodes $2-$4+ per node
No markup Pay AWS list price
Export query results
Spark, Arrow, Python connectors, integration with ML tools, export query results
Export query results
Exports query results
Export query results or table
Export query results
Limited continuous writes. Limited concurrency
Limited continuous writes. Limited concurrency
Yes. Support continuous updates, Kafka connector
Sec-min first-time query performance
Sec-min first-time query performance
Sub-second first-time query performance
Fixed view
Fixed view
Fixed view, dynamic,
Changing data
Yes
Yes
Yes
None
Result cache, materialized view
F3 (cache) (Aggregate index, join index)
Limited Cost- based optimization
Limited cost- based optimization
Cost-based optimization, vectorization
Index- and cost-based optimization,vectorization JIT, pushdown optimization
Yes.
Yes, and materialized views
Yes, and Aggregate indexes
Workload management (WLM), Short Query Acceleration (SQA)
Native Redshift storage (not Spectrum)
Distribution, sort keys
Partition pruning
Micro-partition / pruning, cluster keys
Sparse indexing

Summary

The detailed comparison of Redshift, Athena, Snowflake, and Firebolt across architecture, scalability, performance, use cases and cost of ownership highlights the following major differences:

In short, some data warehouses are better for different use cases. Using this information will hopefully help you choose the right data warehouse or query engine for different use cases. It may also help you prepare your analytics infrastructure so that these choices can be made. 

There has never been such a thing as a single data warehouse that satisfied all the analytics needs of a company. The combination of today’s modern data pipelines and data lakes with the simplicity of cloud services have made adding another cloud warehouse relatively straightforward and cost effective. Those companies that were already using a cloud data warehouse for reporting and dashboards were able to add Firebolt as another cloud data warehouse in weeks, and use it for ad hoc, semi-structured data, operational and customer-facing analytics, while leaving the existing analytics in place.

If you do put a modern data pipeline in place that does make it easier to redirect your data from a data lake or other “single source of the truth” then you will be able to choose the best combination of price and performance for each analytics need.