BigQuery vs Athena

A detailed comparison

Compare BigQuery vs Athena by the following set of categories:

Bigquery Amazon Athena
Elasticity - separation of storage and compute Yes Yes (query engine, no storage)
Supported cloud infrastructure Google Cloud only AWS only
Isolated tenancy - option for dedicated resources Multi-tenant on demand and reserved resources only Multi-tenant pooled resources
Compute - node types No choice over (fixed) slot size N/A no choice
Data - internal/external, writable storage External tables supported - 4 concurrent queries by default* External only
Security - data Separate customer keys, column- level encryption, encryption at rest, AEAD individual value encryption, RBAC Shared resources
Security - network TLS, Firewall (Google Cloud), TLS, VPN, whitelist/ blacklist control part of GCP Shared resources

BigQuery was one of the first first decoupled storage and compute architectures, released before Snowflake. It is a unique piece of engineering and not a typical data warehouse in part because it started as an on-demand serverless query engine. While its petabit network dramatically lowers network latency for data access for any given compute step, the additional network traffic caused by transferring and caching of data in shared memory over the network after each slot finishes its job instead of in local cache seems to eliminate any major advantage in actual benchmarks. If BigQuery does start to cache locally on slots, watch out Firebolt, you might have some closer competition.

Athena is built on a decoupled storage and compute architecture, though it only provides and controls the compute part and does not manage ingestion or storage. It is also only on multi-tenant shared resources. If you are a Redshift customer you can use Redshift Spectrum, which is like dedicated Athena but not built on Presto deployed on up to 10x the number of Redshift nodes in your own VPC, for the same price as Athena.

BigQuery vs Athena - Architecture

The biggest difference among cloud data warehouses are whether they separate storage and compute, how much they isolate data and compute, and what clouds they can run on.

Bigquery Amazon Athena
Elasticity - (individual) query scalability Automatic allocation of each query to on demand, or reserved and flex slots Automatic (but shared resources)
Elasticity - user (concurrent query) scalability Limited to 100 concurrent users by default* Limited - 20 concurrent queries by default
Write scalability - batch 1,500 load jobs/day (~1 per minute), 100,000 per project, 15TB per job, 6 hour max time* N/A
Write scalability - continuous 1GB/sec w/ no dedup, 100MB./sec w/ dedup, 100K rows per second per table, 100K-500K per project by default* N/A (mostly batch-centric storage)
Data scalability No real limit Up to 100 partitions per table, 100 buckets (default)

BigQuery on demand has several official limitations* that are needed to protect everyone else using on demand from a rogue account or query. But you can easily get around any limitations by switching to reserved slots and requesting higher limits. BigQuery is in production at very large scale with several companies. Even limits with message-based ingestion are not an issue; BigQuery ingests into memory first and later commits to storage, which is a better architecture than Snowflake, Redshift, or Athena. Nevertheless, it is still more of a shared service than Snowflake or Redshift, which means it can theoretically hit shared limits.

Athena is a shared multi-tenant resource, which means each account needs to be throttled to protect every other account’s performance. One customer was unable to handle any table or join above 5 billion rows. By default Athena supports a maximum of 20 concurrent users. If scalability is a top priority, Athena is probably the wrong choice.

BigQuery vs Athena - Scalability

There are three big differences among data warehouses and query engines that limit scalability: decoupled storage and compute, dedicated resources, and continuous ingestion.

Bigquery Amazon Athena
Indexes None None
Query optimization - performance Cost-based optimization Limited cost- based optimization
Tuning Can only purchase reserved or flex slots No choice of resources
Storage format Optimized (capacitor) on Colossus S3
Ingestion performance Writes 1 row at a time. Limits of 100K messages/sec by default* N/A (storage and ingestion separate)
Ingestion latency Immediately visible during ingestion* Not well suited for low latency visibility since unable to see as new values during ingestion
Partitioning Partitions, pruning Partition pruning
Caching Result cache (24 hours), shared memory None
Semi-structured data - native JSON functions within SQL Yes Yes
Semi-structured data - native JSON storage type Can store as strings or STRUCT. But requires UDFs for compute No
Semi-structured data - performance JSON strings slow. Can store as STRUCT and use UDF (JS) Slow (full load into RAM, full scan)

BigQuery has not demonstrated significantly better performance or price-performance compared to Snowflake or Redshift. While remote storage access is much faster using the Jupiter petabit network, the constant writing to and fetching from shared memory over the network for each stage of the query execution (in the DAG) seems to eliminate that advantage. So does the fact that BigQuery does not use indexing. It means slots still have to process all the data stored in larger segments without filtering down to smaller (sorted) ranges. However, BigQuery does have lower latency for message-based ingestion since it does in fact ingest one row at a time and make it immediately available for querying.

Athena, and Presto, should be the worst at performance, by design. The reason is that it sacrifices storage-compute optimization to get support for federated queries across multiple data sources. But there is a reason Presto is so popular. Even with that handicap, Presto and Athena do very well. Presto can come close to Redshift and Snowflake in performance when both Presto and the external storage is managed by experts. But there is no support for indexing. Specifically with Athena, you cannot guarantee performance as a shared multi-tenant resource. In general, if performance is a top concern and you can bring data together via a data pipeline and optimize data with compute, then Athena or Presto are not the best choice.

BigQuery vs Athena - Performance

Performance is the biggest challenge with most data warehouses today.
While decoupled storage and compute architectures improved scalability and simplified administration, for most data warehouses it introduced two bottlenecks; storage, and compute. Most modern cloud data warehouses fetch entire partitions over the network instead of just fetching the specific data needed for each query. While many invest in caching, most do not invest heavily in query optimization. Most vendors also have not improved continuous ingestion or semi-structured data analytics performance, both of which are needed for operational and customer-facing use cases.

Bigquery Amazon Athena
Reporting Yes Yes
Dashboards Fixed view Fixed view
Ad hoc Sec-min first-time query performance Sec-min first-time query performance
Operational or customer-facing analytics (high concurrency, continuously updating / streaming data) Slower query performance. Limited to 100K continuous writes/table, 100 concurrent users by default* Slow query performanceand limited scale. Limited continuous writes and concurrency, slow semi- structured data performance
Data processing engine (Exports or publishes data) 1GB max export file size, exports to Google cloud only* Exports query results
Data science/ML BigQuery ML Export query results

BigQuery, like Snowflake, has broader support for use cases beyond reporting and dashboards. You can isolate workloads by assigning each workload to different reserved slots. Unlike Snowflake, Redshift, or Athena, BigQuery also supports low latency streaming. But like these other three technologies. BigQuery also lacks the performance to support interactive or ad hoc queries at scale. This eliminates BigQuery from being a great option for many operational and customer-facing use cases where the users demand a few seconds of wait at worst, which translates to sub-second query times for the data warehouse.

Athena is one of the best “one-off” query engines; all you have to do is provide the data and pay $5 a TB. If you need to quickly pull together multiple data sources, it’s a great option. Redshift Spectrum is a great add-on option for Redshift for federated queries. But if you don’t need federated queries, need performance, and need anything other than one-off or occasional analytics, Athena is not a good option for any of these use cases. There is no data, network or query optimization, no indexing beyond pruning indexes like others.

BigQuery vs Athena - Use cases

There are a host of different analytics use cases that can be supported by a data warehouse. Look at your legacy technologies and their workloads, as well as the new possible use cases, and figure out which ones you will need to support in the next few years. They include:

Reporting where relatively static reports are created by analysts against historical data, and used by executives, managers, and now increasingly by employees and customers

Dashboards created by analysts against historical or live data, and used by executives, managers, and increasingly by employees and customers via Web-based applications

Interactive and ad hoc analytics within dashboards or other tools for on-the-fly interactive analysis either by expert analysts, or increasingly by employees and customers via self-service

High performance analytics that require very large or complex queries with sub-second performance.

Big data analytics using semi-structured or unstructured data and complex queries or functionality

Operational and customer-facing analytics built by development teams that deliver historical and live data and analytics to larger groups of employees and customers

Bigquery Amazon Athena
Administration - deployment, management No administration or tuning No administration or tuning
Choice - provision different cluster types on same data Yes No
Choice - provision different number of nodes Up to 2000 flex (on demand) slots, Purchase reserved or flex slots 100 at a time with no limits No
Choice - provision different node types No No
Pricing - compute On demand - $5/TB data processed, Flex slots $4 for 100 slots per hour, $1700/month per 100 slots None
Pricing - storage $20/TB active storage, $10/TB inactive N/A (not part of Athena)
Pricing - transferred data Batch is free. Streaming ingest $0.01 per 200MB ($50/TB), streaming reads $1.1/TB $5 per TB scanned (10MB per query)

BigQuery has three different pricing models: on demand, reserved, and flex pricing. If you need a data warehouse, you probably should not be using on demand unless you do not need to scan a lot of data for each query. You should be using reserved slots with flex slots to reduce the costs of workload variations. When you do, your costs will not be far off from Snowflake or Redshift for regular data warehouse workloads. BigQuery does give you the option to also support infrequent analytics, more inline with Athena. In other words, it is the best of both more traditional worlds. Nevertheless, BigQuery’s price-performance is inline with Snowflake and Redshift, which is up to 10x more expensive than Firebolt.

Athena is arguably the easiest, least expensive and best suited for “one-off analytics”. But it is also the most limited, and requires you to manage your own (external) storage and ingestion very well, which is especially hard for continuous ingestion. This makes Athena the least-suited for any ongoing, frequent use case.

BigQuery vs Athena - Cost

This is perhaps the strangest, and yet the clearest comparison; cost. There are a lot of differences in the details, but at a high level, the main differences should be clear.

Compare other data warehouses

See all data warehouse comparisons ->

Want to hear more about Firebolt? Set a meeting with our solution architects: