Snowflake vs Athena

A detailed comparison

Compare Snowflake vs Athena by the following set of categories:

Snowflake Amazon Athena
Elasticity - separation of storage and compute Yes Yes (query engine, no storage)
Supported cloud infrastructure AWS, Azure, Google Cloud AWS only
Tenancy - option for dedicated resources Multi-tenant dedicated resources (Only VPS isolated) Multi-tenant pooled resources
Compute - instance types 1-128 nodes, unknown types N/A no choice
Data - internal/external, writable storage External tables supported External only
Security - data Separate customer keys (only VPS is isolated tenant,Encryption at rest, RBAC Shared resources
Security - network Firewall, SSL, PrivateLink whitelist/blacklist control, isolated for VPS only Shared resources


Snowflake was one of the first decoupled storage and compute architectures, making it the first to have nearly unlimited compute scale and workload isolation, and horizontal user scalability. It runs on AWS, Azure and GCP, and while by default it is multi-tenant compute and data, it can run in a Snowflake VPC. But it only provides 1, 2, 4, … 128 node clusters with no choice of node sizes. To get the biggest nodes you need to choose the biggest cluster.

Athena is built on a decoupled storage and compute architecture, though it only provides and controls the compute part and does not manage ingestion or storage. It is also only on multi-tenant shared resources. If you are a Redshift customer you can use Redshift Spectrum, which is dedicated Athena deployed on up to 10x the number of Redshift nodes in your own VPC, for the same price as Athena.

Snowflake vs Athena - Architecture

The biggest difference among cloud data warehouses are whether they separate storage and compute, how much they isolate data and compute, and what clouds they can run on.

Snowflake Amazon Athena
Elasticity - (individual) query scalability 1-click cluster resize, no choice of node size Automatic (but shared resources)
Elasticity - user (concurrent query) scalability Autoscale up to 10 warehouses. Limited to 20 DML writes in queue per table Limited - 20 concurrent queries by default
Write scalability - batch Strong N/A
Write scalability - continuous Limited to 20 DML writes in queue per table N/A (mostly batch-centric storage)
Upserts Yes (slow) micro-partition rewrites N/A (slow with columnar)
Data scalability No specified storage limit. 4XL data warehouse (128 nodes). 16MB max field size Up to 100 partitions per table, 100 buckets (default)


Snowflake
delivers strong scalability with its decoupled storage and compute architecture. But it is inefficient at scaling for certain queries that require larger nodes, because the only way to get larger nodes is with larger clusters. It is better suited for batch writes and upserts as well because it requires entire micro-partitions to be rewritten for each write. Snowflake also transfers entire micro-partitions over the network, which creates a bottleneck at scale.

Athena is a shared multi-tenant resource, which means each account needs to be throttled to protect every other account’s performance. One customer was unable to handle any table or join above 5 billion rows. By default Athena supports a maximum of 20 concurrent users. If scalability is a top priority, Athena is probably the wrong choice.

Snowflake vs Athena - Scalability

There are three big differences among data warehouses and query engines that limit scalability: decoupled storage and compute, dedicated resources, and continuous ingestion.

Snowflake Amazon Athena
Indexes None. None.
Query optimization - performance Cost-based optimization, vectorization Limited cost- based optimization
Tuning - CPU, RAM, SSD, Storage Can only choose warehouse size not node types No choice of resources
Storage format Optimized micro-partition storage (S3), separate RAM S3
Ingestion performance Batch-centric (micro-partition level locking, limit of 20 queued writes per table) N/A (storage and ingestion separate)
Ingestion latency Batch write preferred (1+ minute interval). Requires rewrite of entire micro-partition N/A (storage and ingestion separate)
Partitioning Micro-partition / pruning, cluster keys Partition pruning
Caching Result cache, materialized view None
Semi-structured data - native JSON functions within SQL Yes Yes (Lambda)
Semi-structured data - native JSON storage type Yes Yes
Semi-structured data - performance Fixed view Fixed view


Snowflake
is a perfect example of a modern storage and compute architecture that is not optimized for performance. Its data access is not optimized. Snowflake has no indexing for fetching the exact data it needs. It only keeps track of data ranges within each micro-partition, which range in size from 50 MB to 150 MB uncompressed, and can overlap. Whenever Snowflake does not have the data cached locally in the virtual warehouse, it has to fetch all of the micro-partitions that might have the data, which can take seconds or longer. While Snowflake does some query plan optimization, it does not show up in the performance of queries, which are 4-6000x slower than Firebolt in customer benchmarks. The two biggest explanations are a lack of indexing, and limited query plan optimization. However, Snowflake does provide result set caching across virtual warehouses in addition to SSD caching within each virtual warehouse. This does deliver solid performance for repetitive query workloads after the first query. 

Athena, and Presto, should be the worst at performance, by design. The reason is that it sacrifices storage-compute optimization to get support for federated queries across multiple data sources. But there is a reason Presto is so popular. Even with that handicap, Presto and Athena do very well. Presto can come close to Redshift and Snowflake in performance when both Presto and the external storage is managed by experts. But there is no support for indexing. Specifically with Athena, you cannot guarantee performance as a shared multi-tenant resource. In general, if performance is a top concern and you can bring data together via a data pipeline and optimize data with compute, then Athena or Presto are not the best choice.

Snowflake vs Athena - Performance

Performance is the biggest challenge with most data warehouses today.
While decoupled storage and compute architectures improved scalability and simplified administration, for most data warehouses it introduced two bottlenecks; storage, and compute. Most modern cloud data warehouses fetch entire partitions over the network instead of just fetching the specific data needed for each query. While many invest in caching, most do not invest heavily in query optimization. Most vendors also have not improved continuous ingestion or semi-structured data analytics performance, both of which are needed for operational and customer-facing use cases.

Snowflake Amazon Athena
Reporting Yes Yes
Dashboards Fixed view Fixed view
Ad hoc Sec-min first-time query performance Sec-min first-time query performance
Operational or customer-facing analytics (high concurrency, continuously updating / streaming data) Limited continuous writes and concurrency, slow semi- structured data performance Limited continuous writes and concurrency, slow semi- structured data perf.
Data processing engine (Exports or publishes data) Export query results or table Exports query results
Data science/ML Spark, Arrow, Python connectors, integration with ML tools, export query results Export query results


Snowflake
has broader support for use cases beyond traditional reporting and dashboards. Its decoupled storage and compute architecture enables you to isolate different workloads to meet SLAs, and it also supports high user concurrency. But Snowflake also does not provide interactive or ad hoc query performance because of inefficient data access along with a lack of extensive indexing and query optimization. Snowflake also cannot support streaming or low latency ingestion below one minute ingestion intervals. All of these limitations exclude Snowflake from many operational use cases and most customer-facing applications that require second-level performance.

Athena is one of the best “one-off” query engines; all you have to do is provide the data and pay $5 a TB. If you need to quickly pull together multiple data sources, it’s a great option. Redshift Spectrum is a great add-on option for Redshift for federated queries. But if you don’t need federated queries, need performance, and need anything other than one-off or occasional analytics, Athena is not a good option for any of these use cases. There is no data, network or query optimization, no indexing beyond pruning indexes like others.

Snowflake vs Athena - Use cases

There are a host of different analytics use cases that can be supported by a data warehouse. Look at your legacy technologies and their workloads, as well as the new possible use cases, and figure out which ones you will need to support in the next few years. They include:

Reporting where relatively static reports are created by analysts against historical data, and used by executives, managers, and now increasingly by employees and customers

Dashboards created by analysts against historical or live data, and used by executives, managers, and increasingly by employees and customers via Web-based applications

Interactive and ad hoc analytics within dashboards or other tools for on-the-fly interactive analysis either by expert analysts, or increasingly by employees and customers via self-service

High performance analytics that require very large or complex queries with sub-second performance.

Big data analytics using semi-structured or unstructured data and complex queries or functionality

Operational and customer-facing analytics built by development teams that deliver historical and live data and analytics to larger groups of employees and customers

Snowflake Amazon Athena
Administration - deployment, management Easy to deploy and resize. Strong performance visibility, limited tuning No administration or tuning
Choice - provision different cluster types on same data Yes No
Choice - provision different number of nodes Yes No
Compute - instance types 1-128 nodes, unknown types Any size, and types
Choice - provision different instance types No No
Pricing - compute $2-$4+ per node. Fast analytics need large cluster ($16-32+/hour) or greater. None
Pricing - storage All storage.$23/40 per TB per month on demand/up front N/A (not part of Athena)
Pricing - transferred data None $5 per TB scanned (10MB per query)


Snowflake
as a more modern cloud data warehouse with decoupled storage and compute is easier to manage for reporting and dashboards, and delivers strong user scalability. It also runs on more than AWS. But like the others, Snowflake does not deliver sub-second performance for ad hoc, interactive analytics at any reasonable scale, or support continuous ingestion well. It is also often very expensive to scale, especially for large data sets, complex queries and semi-structured data.

Athena is arguably the easiest, least expensive and best suited for “one-off analytics”. But it is also the most limited, and requires you to manage your own (external) storage and ingestion very well, which is especially hard for continuous ingestion. This makes the least-suited for any ongoing, frequent use case.

Snowflake vs Athena - Cost

This is perhaps the strangest, and yet the clearest comparison; cost. There are a lot of differences in the details, but at a high level, the main differences should be clear.

Compare other data warehouses

See all data warehouse comparisons ->

Want to hear more about Firebolt? Set a meeting with our solution architects: