Cloud Data Warehouse Evaluation Criteria

1. Performance at Scale

The rapid growth in diversity and size of data can make querying and ETL/ELT processes tedious and frustrating. When evaluating the performance of cloud data warehouses, consider your use-cases. If you’re running customer facing analytics, performance can really make it or break it. But even in internal BI use-cases, the ability to run more queries instead of waiting for data, and to have fresh data quickly accessible for querying can be crucial for success. Think about it, if it took Google a few minutes (or hours) to get your search results, would you still use it? The thing is, slow user experiences make us stop using products. When every query returns painfully slow results, we simply don’t ask as many questions.
‍
There are two main bottlenecks that hold performance back:

Storage bottleneck: data lakes are built for infinite storage, but are terribly slow when large amounts of data need to be scanned and moved to the compute layer for querying.
Compute bottleneck: century-old techniques for processing data are not efficient enough for today’s data sets. Queries that cannot be accelerated through scale-out completely diminish the end-user experience.

In many cases, in order to reach higher performance you’ll have to consume more compute resources resulting in heavy costs. So if your organization has such requirements it’s important to examine the data warehouse architecture, and ensure that performance is achieved efficiently.
‍
Modern cloud data warehouses enable users to analyze large amounts of data at high granularity, and with near real-time query response time, by combining a range of tailored techniques including compression, low-footprint indexing, sparse indexing, cost based optimizers and JIT compilation.

Evaluation criteria:

How does the technology scale in terms of query latency and concurrency ?
What does the vendor’s compute scaling model look like ?
What is the performance profile of a vendor's storage architecture ?
Does the technology support data compression, indexing and pruning ?
Is the compute utilization in line with your current use case ?
Will compute resources meet future scaling needs ?
What add-ons are required to provide support for sub-second response times ?
If using serverless, what is the mix of query types that will maximize usage of serverless sizing unit (eg. slots, virtual warehouse credits, DWU, RPU etc) ?
Does the technology support join, indexing and automated materialized views ?

2. Elasticity

In traditional data warehouses, as usage or data scales up, users will note that performance is no longer meeting their business needs. The notion of decoupling storage and compute enables seamless scaling up or down to support any workload, amount of data and concurrent users. The flexibility to seamlessly resize nodes without expensive and time consuming re-sharding / re-clustering, vacuuming, fragmentation and other heavy lifting tasks, is crucial for handling ever changing resource requirements on demand.
‍
By isolating resources, different teams can start/stop compute engines for different workloads/tasks like ETL, heavy querying and exploration while getting the performance that they need. Choosing a cloud data warehouse which provides full control to allocate the right resources for the right tasks will also help you minimize your bill and enjoy a truly efficient and seamless experience.

Evaluation Criteria:

How does the technology provide the ability to start / stop compute resources ?
Can compute and storage scale independently for cpu, memory, capacity and performance ?
How does the scaling model provide for workload isolation ?
Does the technology provide APIs to automate environment spin-up and spin-down ?
Does the technology provide elastic scaling capabilities ?
What controls are available to avoid cost overruns ?

3. Ease of Use

Modern cloud data warehouses are progressing and becoming much easier to manage than what you’re used to, and this should be a key consideration in your evaluation process. Aim to replace time and resources spent on non productive tasks with valuable data analysis and development.
‍
Make sure your cloud data warehouse simplifies the following:

Servers, clusters, installations and hardware: consider a SaaS platform which takes care of infrastructure, while keeping you with enough power to control your user experience and costs.
Performance: make sure performance at scale is achieved as efficiently as possible, without pre-aggregations, complex scaling up procedures and endless complex optimization projects.
Semi-structured data: you want to be ready to support any data type. Ensure your data warehouse has native support for semi-structured data and querying with SQL. Semi-structured data should be analyzed quickly and easily, without the need for complicated ETL processes that flatten and blow up data set size and costs.
Updates and deletes: simplify the process of supporting merge data updates without re-writing tables
Data structure problems: spare yourself the tedious tasks of fragmentation and vacuuming, these are no longer required in modern solutions.
Assigning resources to users: with complete compute and storage resource isolation, you can granularly control who gets what kind of resource. This helps you both save costs and have more control over the experience your users get when querying data.
Programming languages: don’t create skill gaps in your organization. Program with SQL, and not proprietary programming languages, so that all your users can feel comfortable writing scripts that automate complex procedures in minutes.
Importing data: make sure ingestion is quick and easy, by analyzing and understanding the schema of the data in your data lake. This removes lengthy guesswork iterations and lets you get started with new data as fast as possible.

Evaluation criteria:

Does the technology require infrastructure management (servers, storage etc) ?
What specific infrastructure management is required ?
Does the platform require management tasks such as vacuum, optimize, coalesce etc to maintain efficiency ?
Is the offering built as IaaS or SaaS ?
How much time, resources and manpower will you need to spend on maintenance ?
Does the technology require proprietary, non-transferable skills ?
What ecosystem integrations (ingestion tools, transformation, observability, visualization etc.) does the platform support ?

4. Cost Efficiency

The process of understanding cloud data warehouse costs is not straightforward, as they are dependent not only on pricing models, but also on speed, scale and usage. To evaluate how much a platform will cost, you need to understand:

Bigger data scans equal higher costs. Make sure you're able to leverage indexes that are built in the platform to query less data more efficiently.
Don't spend money on unutilized resources. Make sure that the platform provides granular control over resource consumption, and that you can scale in small linear increments and not in doubles.
Slow and inefficient platforms make manpower costs explode, as engineers, analysts and architects spend hours on optimizations. Once you start leveraging indexes and in some cases, array functions, you start to see efficiencies in the process, not just the product.

Common pricing models include:

Pay per TB scanned (Athena, BigQuery):

The ‘Pay Per TB Scanned’ model includes storage and executed query costs. This means that pricing heavily depends on usage and the size of your workload. While this pricing model works well for small-mid sized data sets, it starts to become pricey when dealing with big-data use cases, where a lot of data needs to be scanned.

Pay for consumed cloud resources (Redshift, Snowflake, Firebolt):

The cost of this model depends on how much you use the platform, performance requirements and dataset sizes, and users are typically charged per hour or second.

Evaluation Criteria:

What is your data set size and performance requirement today and what will it be in 1-2 years?
Will the pricing and scaling model support your future growth?
Does the vendor charge for data set capacity or consumed-compressed capacity ?
Is the provisioning model coarse or granular ? (for example: if you need to increase memory, can you increase it without increasing the number of nodes or increasing vcpus ?)
What types of storage does the vendor recommend (block, filesystem, object storage etc) ?
How does the vendor optimize storage capacity and performance ?

5. Supports Structured and Semi-Structured Data

Data no longer arrives only in predictable and structured formats, it arrives in different types and from different sources. Semi-structured data can enrich your analytics, but most traditional data warehouses aren’t equipped to handle such data. Users waste time and costs on inefficient flattening/unnesting/exploding, which multiplies the number of rows with the number of cells in the arrays. As a result you end up with a much bigger table, more unnecessary costs and painfully slower performance.
‍
A modern data warehouse will enable you to query semi-structured data with standard SQL and without complicated ETL processes which flatten and blow up data set sizes and costs. This can be achieved with native array manipulation functions, and without compromising speed and efficiency.

The right way to handle semi-structured data:

Load semi-structured data without transformation: JSON manipulation functions are used to seamlessly cleanse, fix and organize semi-structured data as it’s ingested
Automatically convert semi-structured data to make it ready for querying: Semi-structured data is automatically converted and stored as arrays of primitive types. Users query the data extremely fast with native array manipulation functions while using standard SQL.

Evaluation Criteria:

What data sources are primarily driving your use cases ?
Does the technology support semi-structured data ?
What are some recommended strategies for managing semi-structured data ?
Does the technology provide the ability to store raw semi-structured data as well as flattened data structures ?

6. Concurrency

The standard business definition is “the number of users using the system at the same time”. For a database it's “the number of queries executing at the same time”.

On the surface it may seem those definitions are very close, but in reality they can be miles apart. A very targeted data app, for example, can have thousands of simultaneous users and yet almost no database concurrency if queries are executing in milliseconds and there are very little collisions. On the other hand, a single person browsing dashboards on a business intelligence tool will have 15 simultaneous queries running much of the time as the tool will execute 15 widget queries at once before queuing.

This makes it a bit tricky to get a real answer when asking a software vendor how their platform handles concurrency. If evaluating databases be sure to determine the actual number of simultaneous queries you expect, if possible, rather than users.

Some vendors impose limitations on concurrency, allowing users to submit only one query at a time and setting a cap on the amount of concurrent queries per account. Moreover, even without limitations, concurrency can lead to performance degradation, which brings us back to elasticity. It’s important to easily be able to add resources on demand to support business growth without compromising performance.

Consider a cloud data warehouse which does not limit concurrency or sacrifice performance. And remember, your business will grow. Your needs today are not your needs tomorrow.

Evaluation Criteria:

How many users will be querying the data simultaneously?
Will one user need to run multiple queries at the same time?
What are your future needs?
What is the effort entailed in increasing the amount of concurrent queries as the business grows?
What are the cost implications of scaling concurrency ?
Is Workload Management required to manage concurrency ?

7. Data Granularity

A common way to bypass performance constraints is aggregations. But how many times have you prepared a report and thought you only need it at the category level, just to discover a few months later that someone needs it at the product level?
‍
The issue with choosing the right levels of granularity is that detailed data can be too voluminous. That’s why granularity is another reason not to compromise on the first factor - choose a platform that supports high performance at scale, without a heavy cost tradeoff.
‍
For example, technologies like sparse indexes enable users to only pull rows that are relevant for the query at the most granular level. This is crucial for performance in data lake environments, as fetching unnecessary data from the low storage layer has a huge performance penalty. Bottom line, think twice before you agree to compromise granularity for performance, a good platform will provide both.

Evaluation Criteria:

What level of granularity do your workloads need ? ( eg: partitions, micro-partitions, files, block level etc)
What is the granularity at which the platform operates ?
Does the platform require add-ons to address granularity ?

8. Deployment Options

Needless to say, you must be able to deploy the data warehouse you select on the cloud you’re using. Some data warehouses are exclusively deployed specifically on AWS, GCP or Azure while others offer multi-cloud deployments.
‍
Most leading solutions are available on AWS, due to its dominance in the public cloud market. AWS has a huge array of services, as well as the most comprehensive network of worldwide data centers. According to Gartner, "AWS is the most mature, enterprise-ready provider, with the deepest capabilities for governing a large number of users and resources."
‍
However the ideal approach is multi-cloud, which enables companies to avoid vendor lock-ins and provides flexibility to negotiate rates and capitalize on services offered by different cloud providers.

Evaluation Criteria:

Is your current cloud platform lacking features that you can find in other cloud platforms?
What do the security and operational models look like as you go across multiple clouds ?
What ecosystem integrations does the cloud provider offer ?

9. Ecosystem Integrations

Healthy ecosystem partners are important for a smooth integration with the tools you already use. Typically data warehouses provided by cloud vendors will have the most extensive integrations with the other tools the vendor offers. Seamless integration with your BI tools, ingestion frameworks and data lake will substantially shorten your time to market.

Evaluation Criteria:

Which tools are included in your stack ?
How easily do they integrate with the cloud data warehouse?
Does the platform provide API and SQL access ?
Does the platform support JDBC, ODBC drivers ?

10. Data Freshness

Real-time analytics are critical in some scenarios, like fraud prevention, predictive maintenance and operational dashboards, but they're not always a "must". Don’t make a technology-first decision, ask yourself first if your organization is mature enough to handle real-time:

Do you have the resources to monitor real-time data continuously?
Do you know what you would do if you saw a sudden spike or anomaly?
Is your organization aligned? Can it react, decide and take action in real-time?
What’s the level of automation within your company?

If you do decide that data freshness is a priority, consider the evaluation criteria below.

Evaluation Criteria:

Does the evaluated cloud data warehouse support continuous ingestion of data?
What integration patterns do you need with the chosen platform ( batch, micro-batch, streaming etc) ?
How does the platform address data latency and query latency ?

Steps for evaluation:

Understand the vendor’s pricing model: what each vendor charges for can speak volumes about the offering; not to mention scalability
Understand infrastructure in terms of compute and storage capabilities
Try to use the same type of compute resources and note the associated costs
Measure response time, batch execution time as well as the compute resources required per platform
Score the vendors based on the price-performance ratio you're seeing for your most common tasks
Document list of need-to-haves Vs nice-to-haves

Food for Thought:

Evaluating the data warehouse platform needs to be done holistically, by understanding not only feature-functionality differences but by focusing on downstream impact of must-have capabilities and nice-to-have features. Downstream impact can be long lasting in terms of customer satisfaction, cost and operational impact. Additionally, every data warehouse vendor offers additional bells and whistles that could come at a premium. These could be features built around machine learning, data integration, data sharing, data catalog etc. Each of these could be addressed with a best of breed approach, without lock-in to a single vendor. Besides, a swiss army knife is hardly the best tool for all home repair jobs. Start the evaluation with the core platform and what the downstream impacts are in mind.

WHITEPAPER

Evaluating Cloud Data Warehouses: 10 criteria to Consider

When selecting a cloud data warehouse, technical and cost constraints make users compromise on certain features. The following checklist of criteria was written to help you determine which factors are most important for the success of your organization.

1. Performance at Scale

Evaluation criteria:

2. Elasticity

Evaluation Criteria:

3. Ease of Use

Evaluation criteria:

4. Cost Efficiency

Pay per TB scanned (Athena, BigQuery):

Pay for consumed cloud resources (Redshift, Snowflake, Firebolt):

Evaluation Criteria:

5. Supports Structured and Semi-Structured Data

Evaluation Criteria:

6. Concurrency

Evaluation Criteria:

7. Data Granularity

Evaluation Criteria:

8. Deployment Options

Evaluation Criteria:

9. Ecosystem Integrations

Evaluation Criteria:

10. Data Freshness

Evaluation Criteria:

Steps for evaluation:

Food for Thought:

Get the Vendor Comparison Template

Talk to a Firebolt solution architect