Snowflake was one of the first decoupled storage and compute architectures, making it the first to have nearly unlimited compute scale and workload isolation, and horizontal user scalability. It runs on AWS, Azure and GCP, and while by default it is multi-tenant compute and data, it can run in a Snowflake VPC. But it only provides 1, 2, 4, … 128 node clusters with no choice of node sizes. To get the biggest nodes you need to choose the biggest cluster.
BigQuery was one of the first first decoupled storage and compute architectures, released before Snowflake. It is a unique piece of engineering and not a typical data warehouse in part because it started as an on-demand serverless query engine. While its petabit network dramatically lowers network latency for data access for any given compute step, the additional network traffic caused by transferring and caching of data in shared memory over the network after each slot finishes its job instead of in local cache seems to eliminate any major advantage in actual benchmarks. If BigQuery does start to cache locally on slots, watch out Firebolt, you might have some closer competition.
Snowflake delivers strong scalability with its decoupled storage and compute architecture. But it is inefficient at scaling for certain queries that require larger nodes, because the only way to get larger nodes is with larger clusters. It is better suited for batch writes as well because it requires entire micro-partitions to be rewritten for each write. Snowflake also transfers entire micro-partitions over the network, which creates a bottleneck at scale.
BigQuery on demand has several official limitations* that are needed to protect everyone else using on demand from a rogue account or query. But you can easily get around any limitations by switching to reserved slots and requesting higher limits. BigQuery is in production at very large scale with several companies. Even limits with message-based ingestion are not an issue; BigQuery ingests into memory first and later commits to storage, which is a better architecture than Snowflake, Redshift, or Athena. Nevertheless, it is still more of a shared service than Snowflake or Redshift, which means it can theoretically hit shared limits.
Snowflake has of a modern storage and compute architecture that is not completely optimized for performance. Its data access is not optimized. Snowflake has no indexing for fetching the exact data it needs. It only keeps track of data ranges within each micro-partition, which range in size from 50 MB to 150 MB uncompressed, and can overlap. Whenever Snowflake does not have the data cached locally in the virtual warehouse, it has to fetch all of the micro-partitions that might have the data, which can take seconds or longer. While Snowflake does some query plan optimization, it does not show up in the performance of queries, which are 4-6000x slower than Firebolt in customer benchmarks. The three biggest reasons are inefficient data access, a lack of indexing, and less query plan optimization. However, Snowflake does provide result set caching across virtual warehouses in addition to SSD caching within each virtual warehouse. This does deliver solid performance for repetitive query workloads after the first query.
BigQuery has not demonstrated significantly better performance or price-performance compared to Snowflake or Redshift. While remote storage access is much faster using the Jupiter petabit network, the constant writing to and fetching from shared memory over the network for each stage of the query execution (in the DAG) seems to eliminate that advantage. So does the fact that BigQuery does not use indexing. It means slots still have to process all the data stored in larger segments without filtering down to smaller (sorted) ranges. However, BigQuery does have lower latency for message-based ingestion since it does in fact ingest one row at a time and make it immediately available for querying.
Snowflake has broader support for use cases beyond traditional reporting and dashboards. Its decoupled storage and compute architecture enables you to isolate different workloads to meet SLAs, and it also supports high user concurrency. But Snowflake also does not provide interactive or ad hoc query performance because of inefficient data access along with a lack of extensive indexing and query optimization. Snowflake also cannot support streaming or low latency ingestion below one minute ingestion intervals. All of these limitations exclude Snowflake from many operational use cases and most customer-facing applications that require second-level performance.
BigQuery, like Snowflake, has broader support for use cases beyond reporting and dashboards. You can isolate workloads by assigning each workload to different reserved slots. Unlike Snowflake, Redshift, or Athena, BigQuery also supports low latency streaming. But like these other three technologies. BigQuery also lacks the performance to support interactive or ad hoc queries at scale. This eliminates BigQuery from being a great option for many operational and customer-facing use cases where the users demand a few seconds of wait at worst, which translates to sub-second query times for the data warehouse.
Snowflake as a more modern cloud data warehouse with decoupled storage and compute is easier to manage for reporting and dashboards, and delivers strong user scalability. It also runs on more than AWS. But like the others, Snowflake does not deliver sub-second performance for ad hoc, interactive analytics at any reasonable scale, or support continuous ingestion well. It is also often very expensive to scale, especially for large data sets, complex queries and semi-structured data.
BigQuery has three different pricing models: on demand, reserved, and flex pricing. If you need a data warehouse, you probably should not be using on demand unless you do not need to scan a lot of data for each query. You should be using reserved slots with flex slots to reduce the costs of workload variations. When you do, your costs will not be far off from Snowflake or Redshift for regular data warehouse workloads. BigQuery does give you the option to also support infrequent analytics, more inline with Athena. In other words, it is the best of both more traditional worlds. Nevertheless, BigQuery’s price-performance is inline with Snowflake and Redshift, which is up to 10x more expensive than Firebolt.