WHITEPAPER

Evaluating Cloud Data Warehouses:
10 Key Factors to Consider

When selecting a cloud data warehouse, technical and cost constraints make users compromise on certain features. The following checklist of criteria was written to help you determine which factors are most important for the success of your organization.

1. Performance at Scale

The rapid growth in diversity and size of data can make querying and ETL/ELT processes tedious and frustrating. When evaluating the performance of cloud data warehouses, consider your use-cases. If you’re running customer facing analytics, performance can really make it or break it. But even in internal BI use-cases, the ability to run more queries instead of waiting for data, and to have fresh data quickly accessible for querying can be crucial for success. Think about it, If it took Google a few minutes (or hours) to get your search results, would you still use it? The thing is, slow user experiences make us stop using products. When every query returns painfully slow results, we simply don’t ask as many questions.

There are two main bottlenecks that hold performance back:

In many cases, for higher performance you’ll have to consume more compute resources resulting in heavy costs. So if your organization has such requirements it’s important to examine the data warehouse architecture, and ensure that performance is achieved efficiently. 
Modern cloud data warehouses enable users to analyze large amounts of data at high granularity, and with near real-time query response time, by combining a range of tailored techniques including compression, low-footprint indexing, sparse indexing, cost based optimizers and JIT compilation.

Steps for evaluation:

Ask yourself...

How fast can this data warehouse run, on how much data and with how many resources (affecting costs)?

2. Elasticity

In traditional data warehouses, as usage or data scales up, users will note that performance is no longer meeting their business needs. The notion of decoupling storage and compute enables seamless scaling up or down to support any workload, amount of data and concurrent users. The flexibility to seamlessly resize nodes without expensive and time consuming re-sharding / re-clustering, vacuuming, fragmentation and other heavy lifting tasks, is crucial for handling ever changing resource requirements on demand. 
By isolating resources, different teams can start/stop compute engines for different workloads/tasks like ETL, heavy querying and exploration while getting the performance that they need. Choosing a cloud data warehouse which provides full control to allocate the right resources for the right tasks will also help you minimize your bill and enjoy a truly efficient and seamless experience.

Ask yourself...

How simple is it to dynamically allocate different resources for different tasks?

3. Ease of Use

Modern cloud data warehouses are progressing and becoming much easier to manage than what you’re used to, and this should be a key consideration in your evaluation process. Aim to replace time and resources spent on non productive tasks with valuable data analysis and development. 

Make sure your cloud data warehouse simplifies the following:

Ask yourself...

How much time, resources and manpower will you need to spend on maintenance as opposed to insight discovery?

4. Cost Efficiency

The process of understanding cloud data warehouse pricing models is not straightforward, as they are dependent on different parameters like speed, scale and usage. Common pricing models include:

Pay per TB scanned (Athena, BigQuery):

The ‘Pay Per TB Scanned’ model includes storage and executed query costs. This means that pricing heavily depends on usage and the size of your workload. While this pricing model works well for small-mid sized data sets, it starts to become pricey when dealing with big-data use cases, where a lot of data needs to be scanned.

Relevant use-cases:

Pay for consumed cloud resources (Redshift, Snowflake):

The cost of this model depends on how much you use the platform, performance requirements and dataset sizes, and users are typically charged per hour or second. The drawback is that use-cases which require high performance and frequent querying will inflate your budget. Large organizations with many users, large data set sizes and high performance requirements need to always be on, and paying high costs per use isn’t attractive if your usage is 24/7. 

Relevant use-cases:

Pay for cloud resources at the base cost + fixed subscription (Firebolt):

In this model the vendor doesn’t make any profit from the cloud resources customers consume. The incentive behind it is to give companies the liberty to maximize performance and usage as well as support future growth, without worrying about rampant costs. Users pay for cloud resources at the base cost and a fixed annual subscription with three tiers based on the dataset size. While this is the most cost effective pricing model, it doesn’t include monthly and on-demand options, which means it’s less suitable for small workloads and lower usage patterns.

Relevant use-cases:

Regardless of your use-cases, why should costs stand in the way of an amazing insight? A good solution encourages you to run more queries, on more data, by more users to get more value, and all that without the cost trade off.

Ask yourself...

What is your data set size and performance requirement today and what will it be in 1-2 years? Will the pricing model support your future growth?

5. Supports Structured and Semi-Structured Data

Data no longer arrives only in predictable and structured formats, it arrives in different types and from different sources. Semi-structured data can enrich your analytics, but most traditional data warehouses aren’t equipped to handle such data. Users waste time and costs on inefficient flattening/unnesting/exploding, which multiplies the number of rows with the number of cells in the arrays. As a result you end up with a much bigger table, more unnecessary costs and painfully slower performance.

A modern data warehouse will enable you to query semi-structured data with standard SQL and without complicated ETL processes which flatten and blow up data set sizes and costs. This can be achieved with native array manipulation functions, and without compromising speed and efficiency.

The right way to handle semi-structured data:

  1. Load semi-structured data without transformation: JSON manipulation functions are used to seamlessly cleanse, fix and organize semi-structured data as it’s ingested
  2. Automatically convert semi-structured data to make it ready for querying: Semi-structured data is automatically converted and stored as arrays of primitive types. Users query the data extremely fast with native array manipulation functions while using standard SQL

Ask yourself...

Will you get the same experience as working with structured data when working with semi-structured data? What will be the tradeoffs of working with semi-structured data?

6. Concurrency

Some vendors impose limitations on concurrency, allowing users to submit only one query at a time and setting a cap on the amount of concurrent queries per account. Moreover, even without limitations, concurrency can lead to performance degradation, which brings me back to my point about elasticity. It’s important to easily be able to add resources on demand to support business growth without compromising performance.

Consider a cloud data warehouse which does not limit concurrency or sacrifice performance. And remember, your business will grow. Your needs today are not your needs tomorrow. 

Ask yourself...

7. Data Granularity

A common way to bypass performance constraints is aggregations. But how many times have you prepared a report and thought you only need it at the category level, just to discover a few months later that someone needs it at the product level?

The issue with choosing the right levels of granularity is that detailed data can be too voluminous. That’s why granularity is another reason not to compromise on the first factor -  choose a platform that supports high performance at scale, without a heavy cost tradeoff. 

For example, technologies like sparse indexes enable users to only pull rows that are relevant for the query at the most granular level. This is crucial for performance in data lake environments, as fetching unnecessary data from the low storage layer has a huge performance penalty. Bottom line, think twice before you agree to compromise granularity for performance, a good platform will provide both. 

Ask yourself...

Will you have to sacrifice granularity to achieve performance?

8. Deployment Options

Needless to say, you must be able to deploy the data warehouse you select on the cloud you’re using. Some data warehouses are exclusively deployed specifically on AWS, GCP or Azure while others offer multi-cloud deployments. 

Most leading solutions are available on AWS, due to its dominance in the public cloud market. AWS has a huge array of services, as well as the most comprehensive network of worldwide data centers. According to Gartner, "AWS is the most mature, enterprise-ready provider, with the deepest capabilities for governing a large number of users and resources."

However the ideal approach is multi-cloud, which enables companies to avoid vendor lock-ins and provides flexibility to negotiate rates and capitalize on services offered by different cloud providers.

Ask yourself...

Is your current cloud platform lacking features that you can find in other cloud platforms?

9. Ecosystem Integrations

Healthy ecosystem partners are important for a smooth integration with the tools you already use. Typically data warehouses provided by cloud vendors will have the most extensive integrations with the other tools the vendor offers. Seamless integration with your BI tools, ingestion frameworks and data lake will substantially shorten your time to market. 

Ask yourself...

Which tools are included in your stack and how easily do they integrate with the evaluated cloud data warehouse?

10. Data Freshness

Some use-cases essentially require real-time analytics, like fraud prevention, predictive maintenance and operational dashboards. But as a rule of thumb, for any use-case, the fresher the data, the more accurate the analytics will be. Looking at a batch report from yesterday is not as valuable as analytics which are continuously up to date. This is key to improving any service.

Choose a cloud data warehouse that supports streaming along with batch at high performance. The more data you’ll be able to make sense of faster will determine how fast you’ll adjust to changing conditions and be ahead of your competitors. 

Ask yourself...

Does the evaluated cloud data warehouse support continuous ingestion of data?

Vendor Comparison Template

Open the comparison
template in Google sheets