WHITEPAPER

A Comparison of Data Warehouses and Query Engines

Table of contents

1. Introduction
2. The evolution of data warehousing - from on premises to the cloud
3. Comparison table - Redshift, Athena, Snowflake, Firebolt
4. Summary
5. About Firebolt

Introduction

Choosing the right cloud data warehouse and analytics infrastructure can be confusing. Over the last decade, the number of options have exploded:

Making the right choice partly requires a deeper understanding into how these data warehouses and query engines compare. But it also requires a deeper understanding of what your own analytics needs are today, and what they will be over the next few years.  This evolution in cloud data warehousing has been driven by major changes in analytics needs. Today most companies have the following types of analytics:

This whitepaper compares the major cloud data warehouse and query engine options across more than 30 different categories including their overall architectures and core features, scalability, performance, and cost, as well as their suitability across these different analytics use cases.

The evolution of data warehousing - from on premises to the cloud

In the beginning - the challenges of on premises data warehouses

Before big data, analytics centered around on-premises data warehouse leaders, including Oracle, Teradata, Netezza (IBM), Vertica (HP), ParAccel (Actian) and GreenPlum (VMWare). Management and business intelligence (BI) teams relied on batch loads from various applications to build reports or populate dashboards for decision making. By 2010, many deployments were starting to hit their limits for several reasons:

Hadoop gained popularity as a solution for storing and analyzing massive amounts of data. But it was batch-based and complicated to manage.  It could not support live, continuous ingestion or fast query performance, or ad hoc analytics. In the meantime, companies also started to look to the cloud to help lower costs.

1st generation cloud data warehouses - Amazon Redshift

Amazon Redshift was one of the first, and originally the most successful data warehouse as a cloud service. It is a cloud service that was designed to shift “Red” (Oracle) and other data warehouse deployments to the cloud. Amazon accomplished this by building Redshift using a version of ParAccel, which is based on PostgreSQL.

While Redshift has definitely adopted many of the advantages of cloud services such as simplified management, it has also kept several of the limitations of traditional data warehouses:

2nd generation cloud data warehouses - Snowflake, BigQuery

Snowflake and Google BigQuery were two of the first cloud data warehouses to separate storage and compute, and add elastic scalability.

Snowflake

Snowflake was the first modern cloud data warehouse with a decoupled storage and compute architecture. It provides you with dedicated compute resources that run within Snowflake’s cloud environment. The core computing unit is a virtual data warehouse (or warehouse). Once you provision a warehouse of 1, 2, 4, 8, 16, 32, 64, or 128 nodes (or servers), and assign users to it, the warehouse pulls data from storage as needed by the queries and stores it on local storage, which it refers to as a cache. To scale queries, you can manually provision a new, larger warehouse in seconds. To scale users, a multi-clustering option, only available with the Enterprise edition and higher, can automatically provision and deprovision up to 10 identical warehouses and assign users across them.

When compared to Redshift, Snowflake does simplify both scalability and administration. But many of the same challenges still remain:

BigQuery

Google BigQuery was released before Snowflake as a serverless query engine with decoupled storage and compute. So why not call it the first modern cloud data warehouse? The short answer is that like other great technologies out of Google, it is a unique piece of engineering that is not a typical data warehouse. It was a serverless on demand query engine that also became a data warehouse over time. That is both one of its biggest strengths and weaknesses.

BigQuery started as a public implementation of several internal technologies, which you can read about in the official BigQuery whitepaper. The distributed query engine is based on Dremel. BigQuery uses a columnar store file system called Colossus that stores data in Capacitor with a format that allows BigQuery to operate directly on compressed data. Dremel distributes queries to an execution tree of leaves (slots) and a hierarchy of mixers similar to MapReduce that aggregate results. For each query, Borg, which is Google’s cluster management system, dynamically provisions the server resources (slots). The compute relies on Jupiter, a petabit (yes 1 million gigabit) network for accessing all data. As each task completes the slot, it moves the results into shared memory for the next task (e.g. a Mixer) to pick it up. BigQuery will shuffle to repartition data for the next steps similar to MapReduce. If data outgrows memory or becomes too expensive, it can shuffle to disk (Colossus.)

Today you can run BigQuery on demand, with reserved slots that you run 24x7, or with flex slots that you can run for seconds. On demand, which costs $5/TB of uncompressed data scanned, is great for running the occasional analytics or for queries where you’re not scanning a lot. But on demand can quickly cost a fortune if you run queries regularly, which is why most data warehouse customers now buy reserved slots at $2000 a month per 100 or $1700 a month for an annual commitment. More recently, Google also released flex slots, which are priced at $4/hour for 100 slots (a 44% premium over monthly reserved.) The combination of reserved and flex slots makes BigQuery price-competitive with Snowflake.

Unlike Snowflake, BigQuery supports streaming ingestion at scale. They accomplish this with an architecture similar to Vertica that is also used by a few other specialized engines. They ingest one message (row) at a time into row-based memory and then eventually commit to columnar storage.

BigQuery, like Snowflake, supports any scale and has even less administrative overhead as a serverless query engine. Despite the impressive engineering feats as a dynamically provisioned query engine, BigQuery still has several challenges as a data warehouse:

2nd Generation query engines - Presto and Amazon Athena

Presto was originally developed by Facebook to query data in Hadoop. Facebook needed a query engine that was faster than Hive and could operate across Hadoop and other data. They ended up creating one of the first high performance federated query engines that separated storage and compute, and was able to query existing data “as is”. Since Facebook contributed it to open source, Presto has become one of the most popular open source SQL query engines on top of data lakes or simple storage. 

Amazon Athena takes Presto as its foundation, and turns it into a serverless SQL query cloud service. It is a multi-tenant service of pooled resources that handles all requests across all customers. The data is stored in S3. Presto (and Athena) acts as a federated query engine where queries are broken down into tasks executed in a cluster or pushed down to target databases. Redshift Spectrum, while implemented differently from Athena, gives Redshift customers integrated federated querying.

One of Presto’s key strengths, the ability to federate queries across different data sources, is also one of its weaknesses when it comes to improving performance or simplifying administration. You need to implement your own data pipeline independently of Presto, and then configure Presto to work against the data. While Presto can take advantage of optimized storage formats such as Parquet, it is not out of the box. You can also push down queries to execute locally within databases and data warehouses such as Redshift. But Presto is not optimized for performance with storage. It also does not implement indexing, sorting, materialized views, caching, or other performance-related features. 

3rd generation data warehouses - Firebolt

For most companies, the main reasons for adopting cloud technologies have always been to lower costs and adopt the latest innovations. While 1st generation data warehouses focused on delivering an easier-to-deploy cloud service, and the 2nd generation focused elastic scalability and ease of administration, 3rd generation data warehouses are focused on using some of the latest innovations in analytics and cloud over the last few years to deliver much faster query performance and a much lower total cost of ownership (TCO) at any scale.

The major limitations to performance are in the separation of storage and compute, and query engine design. The first time data is needed, it has to be fetched from remote storage over the network, and in most distributed systems with large data sets, the network is the first bottleneck. A 10GigE network can transport roughly 1 GB per second fully loaded, if you are lucky. A terabyte will take over 16 minutes. Even with a network that is 10-100x faster, queries need to be optimized to pull as little data as possible from cold storage. Most cloud data warehouses today pull much larger sets. Snowflake, for example, will prune out micro-partitions by keeping track of the min-max data ranges in each micro-partition. But each micro-partition is 50-150MB.

The second major performance bottleneck is the query processing itself.  For one, most cloud data warehouses do not take advantage of some traditional performance techniques, such as indexing. Several data warehouses such as Athena and Snowflake also do not expose a lot of tuning options. But there are also some more recent optimization techniques that have been refined in the last few years, from cost-based optimization to just-in-time (JIT) compilation for distributed queries.

The major bottleneck in cost is pricing. Every incremental query costs money, and big, complex queries cost a lot of money. If you look at the Fivetran benchmark, which managed 1TB of data most of the clusters cost $16 per hour. That was the enterprise ($2 per credit) pricing for Snowflake.  Running business-critical or Virtual Private Snowflake (VPS) would be $4 or more per credit. Running it full-time with 1TB of data would be roughly $300,000 per year at list price. This enormous cost forces many IT organizations to throttle Snowflake usage, often down to individual queries, and prune data. But pruning data and restricting usage goes against the very goal of the democratization of data and analytics.

Firebolt is the first 3rd generation cloud data warehouse that has added a number of innovations to improve performance and lower TCO. Some of these innovations include: 

The combination of features has delivered 4-6000x faster performance in customer banchmarks, across queries and sub-second performance. Note in these customer benchmarks that while price-performance is 10x or greater, the total cost savings are less, only up to 10x lower at most. This is because most customers improve performance 4-6000x across queries compared to the previous solution.

Comparison: Redshift vs Athena vs BigQuery vs Snowflake vs Firebolt

What follows is a detailed side-by-side comparison of the main options across the 1st, 2nd, and 3rd generation data warehouses and query engines: Amazon Redshift, Amazon Athena, Snowflake and Firebolt. The comparison spans more than 30 categories, including:

1st generation
Scalability
Architecture
2nd generation
2nd generation
3rd generation
2nd generation
Elasticity - user (concurrent query) scalability
Elasticity - (individual) query scalability
Data - internal/external, writable storage
Isolated tenancy - option for dedicated resources
Compute - node types
Supported cloud infrastructure
Elasticity - separation of storage and compute
Security - data
Security - network
Write scalability - continuous
Write scalability - batch
Data scalability
Performance
Storage format
Tuning
Query optimization - performance
Indexes
Ingestion performance
Ingestion latency
Semi-structured data - native JSON storage type
Semi-structured data - native JSON functions within SQL
Caching
Partitioning
Semi-structured data - performance
Use cases
COST
Dashboards
Reporting
Administration - deployment, management
Choice - provision different number of nodes
Choice - provision different node types
Pricing - compute
Pricing - storage
Pricing - transferred data
Ad hoc, interactive analytics
Invoke ML (SageMaker) in SQL
Unload data as Parquet
Slower query performance. Limited continuous writes and concurrency, limited semi- structured data support
Slow query performanceand limited scale. Limited continuous writes and concurrency, slow semi- structured data performance
Slower query performance. Limited to 100K continuous writes/table, 100 concurrent users by default*
Slower query performance. Limited continuous writes and concurrency, slow semi- structured data performance
Yes. Support continuous writes at scale, fast semi-structured performance
Exports query results
1GB max export file size, exports to Google cloud only*
Export query results or table
Export query results
Data science/ML
Data processing engine (Exports or publishes data)
Operational or customer-facing analytics (high concurrency, continuously updating / streaming data)
Export query results
BigQuery ML
Spark, Arrow, Python connectors, integration with ML tools, export query results
Export query results
Choice - provision different cluster types on same data
1 master cluster
N/A
1,500 load jobs/day (~1 per minute), 100,000 per project, 15TB per job, 6 hour max time*
Strong
Strong. Multi-master parallel batch
Autoscale up to 10 clusters, 15 queries per cluster, 50 queued queries total, max.
Manual
Automatic (but shared resources)
Automatic allocation of each query to on demand, or reserved and flex slots
1-click cluster resize, no choice of node size
1-click cluster resize of node type, number of nodes
Internal writable, external (Spectrum)
Multi-tenant pooled resources
Multi-tenant on demand and reserved resources only
Multi-tenant resources (Only VPS dedicated on isolated Snowflake account)
Multi-tenant dedicated resources
Isolated tenant, use own VPC
Any size, and types
N/A no choice
No choice over (fixed) slot size
1-128 nodes, unknown types
1-128 choice of any types
AWS only
Redshift RA3, Spectrum only
Yes (query engine, no storage)
Yes
Yes
Yes
AWS only
Google Cloud only
AWS, Azure, Google Cloud
AWS only
Isolated/VPC tenant for storage and compute
Encryption at rest, RBAC
Shared resources
Separate customer keys, column-level encryption, encryption at rest, AEAD individual value encryption, RBAC
Dedicated resources for storage and compute
Encryption at rest, RBAC
Separate customer keys (only VPS is isolated tenant) Encryption at rest, RBAC
Limited (table-level locking)
Limited - 20 concurrent queries by default
Limited to 100 concurrent users by default*
External only
External tables supported - 4 concurrent queries by default*
External tables supported
Firewall, SSL, PrivateLink whitelist/ blacklist control, isolated/VPC tenancy
Shared resources
TLS, Firewall (Google Cloud), TLS, VPN, whitelist/ blacklist control part of GCP
Firewall, SSL, PrivateLink whitelist/ blacklist control, isolated for VPS only
Firewall and WAF, SSL, PrivateLink whitelist/ blacklist control
External tables (used for ingestion)
N/A (mostly batch-centric storage)
1GB/sec w/ no dedup, 100MB./sec w/ dedup, 100K rows per second per table, 100K-500K per project by default*
Autoscale up to 10 warehouses. Limited to 20 DML writes in queue per table
Limited to 20 DML writes in queue per table. 1 minute or greater ingestion latency recommended
Unlimited manual scaling
Multi-master continuous writes
Only with RA3: 128 RA3 nodes, 8PB of data.
Up to 100 partitions per table, 100 buckets (default)
No real limit
No specified storage limit. 4XL data warehouse (128 nodes)
No limit
Native Redshift storage (not Spectrum)
Choice of (limited) node types
Limited cost- based optimization
None
None
None
None
Limited cost- based optimization
Cost-based optimization
Cost-based optimization, vectorization
Index- and
cost-based optimization, vectorization JIT, pushdown optimization
Indices for data access, joins, aggregation, search
No choice of resources
Can only purchase reserved or flex slots
Can only choose warehouse size not node types
Choice of node types, Indexing
S3
Optimized (capacitor) on Colossus
Optimized micro-partition storage (S3), separate RAM
Optimized F3 storage (on S3). Data access integrated across disk, SSD, RAM)
Batch-centric (table-level locking)
N/A (storage and ingestion separate)
Writes 1 row at a time. Limits of 100K messages/sec by default*
Not well suited for low latency visibility since unable to see as new values during ingestion
Immediately visible during ingestion
No
Can store as strings or STRUCT. But requires UDFs for compute
Slow (full load into RAM, full scan)
JSON strings slow. Can store as STRUCT and use UDF (JS)
Slow (can require full load into RAM, full scan)
Fast (array operations without full scans)
Fixed view
Fixed view
Fixed view
Fixed view, dynamic /
changing data
Seconds-minutes first-time query performance
Seconds-minutes first-time query performance
Seconds-minutes first-time query performance
Sub-second to seconds first-time query performance
No administration or tuning
No administration or tuning
Easy to deploy and resize
Strong performance visibility limited tuning
Easy to deploy and resize
Easy to add indexing, change node types
No
No
Yes
Choice of fixed size warehouses
Choice of node types, engine sizes
Batch-centric (micro-partition level locking, limit of 20 queued writes per table)
Batch write preferred (1+ minute interval). Requires rewrite of entire micro-partition
Immediately visible during ingestion
Limited VARIANT (single field)
Yes (Nested array type, compact)
Multi-master, lock-free high performance ingestion with unlimited scale for batch or continuous ingestion
Batch-centric (minute-level)
No
Limited
Result cache
Distribution, sort keys
Partition pruning
Partitions, pruning
Micro-partition / pruning, cluster keys
F3 with sparse indexing
None
Result cache (24 hours), shared memory
Result cache, materialized view
F3 (cache) with aggregating, join, and search indexes
Yes (Lambda)
Yes
Yes
Yes (Lambda)
Slow (flattens JSON into table)
Fixed view
Yes
Yes
Yes
Yes
Yes
Seconds-minutes first-time query performance
Easy to provision, harder to configure and manage
Yes
Yes
(Limited)
$0.25-13 per node on demand
<40% less when up front
Stored data only.
RA3: $24 per TB per monthS3: AWS S3 costs
Spectrum: $5 per TB scanned (10MB min per query)
$5 per TB scanned (10MB per query)
Batch is free.Streaming ingest $0.01 per 200MB ($50/TB), streaming reads $1.1/TB
None
None
N/A
(not part of Athena)
$20/TB active storage, $10/TB inactive
All storage.
$23/40 per TB per month on demand/up front
All storage.
$23/40 per TB per month on demand/up front
None
On demand - $5/TB data processed, Flex slots $4 for 100 slots per hour
$1700/month per 100 slots
$2-$4+ per node. Fast analytics need large cluster ($16-32+/hour) or greater
Choose any any node. Compute costs range $1-10/hour with 10x price- performance advantage
No
No
No
Yes
No
Up to 2000 flex (on demand) slots, Purchase reserved or flex slots 100 at a time with no limits
Yes
Yes
Redshift
1st generation
ARCHITECTURE (overall)
Athena
2nd generation

Firebolt
3rd generation
Snowflake
2nd generation
Elasticity - separation of storage and compute
Supported cloud infrastructure
Elasticity - separation of storage and compute
Control - provision different cluster types on same data
Performance
Control - provision different number of nodes
Control - provision different instance types
Write scalability - batch
Write scalability - continuous
Upserts
Use cases
Price
(keep high level)
Cloud
Semi-structured data - native JSON functions within SQL
Semi-structured data - native JSON storage type
Semi-structured data - performance
Data scalability
Indexes
Sorting (REMOVE FOR NOW)
Query optimization - performance
Query prioritization -
Predictability
(HOLD BACK ON ROW FOR A FEW WEEKS)
Storage format
Partitioning
Caching
Reporting
Dashboards
Ad hoc
Operational or customer-facing analytics (high concurrency, continuously updating / streaming data)
Data processing engine (Exports or publishes data)
Data science/ML
Compute
Storage
Transferred data
Supported cloud infrastructure
Tenancy - option for dedicated resources
Only with RA3: 128 RA3 nodes, 8PB of data.
Up to 100 partitions per table, 100 buckets (default)
No specified storage limit. 4XL data warehouse (128 nodes). 16MB max field size
No limit
AWS only
Limited - 20 concurrent queries by default
Autoscale up to 10 warehouses. Limited to 20 DML writes in queue per table.
Unlimited manual scaling
Redshift RA3, Spectrum only
Yes (query engine, no storage)
Yes
Yes
Redshift RA3, Spectrum only
No
Yes (query engine, no storage)
No
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
(Limited)
No
N/A
N/A (mostly batch-centric storage)
Yes (Lambda)
Yes
Yes (Lambda)
None.
None.
Sparse, aggregate, join indexes
None
Provision different virtual warehouses
Provision different engines, sizes, instance types. Thread-level optimization
N/A. Supports external formats
Optimized storage (S3)
Optimized F3 from storage to RAM
No
Strong
Strong. Multi-master parallel batch
Limited to 20 DML writes in queue per table
Multi-master continuous updates
Yes
1 master cluster
Limited (table-level locking)
Yes (limited)
N/A (slow with columnar)
Yes (slow) micro-partition rewrites
Yes (fast)
Limited
No
Slow (flattens JSON into table)
Slow (full load into RAM, full scan)
Slow (full load into RAM, full scan)
Fast (array operations without full scans)
No
Limited VARIANT (single field)
Yes (Nested array type, compact)
None.
Yes, and materialized views.
Result cache
Yes
Fixed view
Sec-min first-time query performance
Limited continuous writes. Limited concurrency
Unload data as Parquet
Invoke ML (SageMaker) in SQL
Markup on AWS nodes
$0.25-13 per node on demand
Stored data only.
RA3: $24 per TB per monthS3: AWS S3 costs.
Spectrum: $5 per TB scanned (10MB min per query)
AWS only
Isolated or multi-tenant (including Spectrum)
Multi-tenant / pooled resources
Multi-tenant dedicated resources (Only VPS isolated)
Isolated tenancy Compute and storage
AWS only
AWS, Azure, Google Cloud
AWS only
$5 per TB scanned (10MB per query)
None.
None.
Should we be aggressive on Azure, GCP
N/A
(not part of Athena)
All storage.
$23/40 per TB per month on demand/up front
Stored data only.
1TB: free1-5TB: $25K 5-200TB: $80K
200TB+: ask for quote
None.
Markup on AWS nodes $2-$4+ per node
No markup Pay AWS list price
Export query results
Spark, Arrow, Python connectors, integration with ML tools, export query results
Export query results
Exports query results
Export query results or table
Export query results
Limited continuous writes. Limited concurrency
Limited continuous writes. Limited concurrency
Yes. Support continuous updates, Kafka connector
Sec-min first-time query performance
Sec-min first-time query performance
Sub-second first-time query performance
Fixed view
Fixed view
Fixed view, dynamic,
Changing data
Yes
Yes
Yes
None
Result cache, materialized view
F3 (cache) (Aggregate index, join index)
Limited Cost- based optimization
Limited cost- based optimization
Cost-based optimization, vectorization
Index- and cost-based optimization,vectorization JIT, pushdown optimization
Yes.
Yes, and materialized views
Yes, and Aggregate indexes
Workload management (WLM), Short Query Acceleration (SQA)
Native Redshift storage (not Spectrum)
Distribution, sort keys
Partition pruning
Micro-partition / pruning, cluster keys
Sparse indexing

*While Google BigQuery publishes official quotas and other limits, in fact you can ask for them to be raised when you have reserved slots. None appear to be hard technical limits.

Summary

The detailed comparison of Redshift, Athena, Snowflake, BigQuery, and Firebolt across architecture, scalability, performance, use cases and cost of ownership highlights the following major differences:

In short, some data warehouses are better for different use cases. Using this information will hopefully help you choose the right data warehouse or query engine for different use cases. It may also help you prepare your analytics infrastructure so that these choices can be made. 

The combination of today’s modern data pipelines and data lakes with the simplicity of cloud services have made adding another cloud warehouse relatively straightforward and cost effective. Those companies that were already using a cloud data warehouse for reporting and dashboards were able to add Firebolt as another cloud data warehouse in weeks, and use it for ad hoc, semi-structured data, operational and customer-facing analytics, while leaving the existing analytics in place.

If you do put a modern data pipeline in place that does make it easier to redirect your data from a data lake or other “single source of the truth” then you will be able to choose the best combination of price and performance for each analytics need.

About Firebolt

Firebolt is the world’s fastest cloud data warehouse, purpose-built for delivering a new grade of analytic experiences over big data. It provides tech companies with a modern data stack, and orders of magnitude faster query performance at a fraction of the cost of the alternatives by combining the simplicity, elasticity and low cost of the cloud with the latest innovations in analytics. Engineering teams that adopted Firebolt have been able to deploy data applications and interactive analytics over big data in weeks while delivering sub-second performance at terabyte to petabyte scale.

Firebolt is founded by members of the original team at Sisense, A global BI leader, and backed by Zeev Ventures, TLV Partners, Bessemer Venture Partners and Angular Ventures.

Talk to a Firebolt solution architect