The rapid growth in diversity and size of data can make querying and ETL/ELT processes tedious and frustrating. When evaluating the performance of cloud data warehouses, consider your use-cases. If you’re running customer facing analytics, performance can really make it or break it. But even in internal BI use-cases, the ability to run more queries instead of waiting for data, and to have fresh data quickly accessible for querying can be crucial for success. Think about it, if it took Google a few minutes (or hours) to get your search results, would you still use it? The thing is, slow user experiences make us stop using products. When every query returns painfully slow results, we simply don’t ask as many questions.
There are two main bottlenecks that hold performance back:
In many cases, for higher performance you’ll have to consume more compute resources resulting in heavy costs. So if your organization has such requirements it’s important to examine the data warehouse architecture, and ensure that performance is achieved efficiently.
Modern cloud data warehouses enable users to analyze large amounts of data at high granularity, and with near real-time query response time, by combining a range of tailored techniques including compression, low-footprint indexing, sparse indexing, cost based optimizers and JIT compilation.
How does the technology scale in terms of query latency and concurrency ?
What does the vendor’s compute scaling model look like ?
What is the performance profile of a vendor's storage architecture ?
Does the technology support data compression, indexing and pruning ?
Is the compute utilization in line with your current use case ?
Will compute resources meet future scaling needs ?
What add-ons are required to provide support for sub-second response times ?
If using serverless, what is the mix of query types that will maximize usage of serverless sizing unit (eg. slots, virtual warehouse credits, DWU, RPU etc) ?
Does the technology support join, indexing and automated materialized views ?
In traditional data warehouses, as usage or data scales up, users will note that performance is no longer meeting their business needs. The notion of decoupling storage and compute enables seamless scaling up or down to support any workload, amount of data and concurrent users. The flexibility to seamlessly resize nodes without expensive and time consuming re-sharding / re-clustering, vacuuming, fragmentation and other heavy lifting tasks, is crucial for handling ever changing resource requirements on demand.
By isolating resources, different teams can start/stop compute engines for different workloads/tasks like ETL, heavy querying and exploration while getting the performance that they need. Choosing a cloud data warehouse which provides full control to allocate the right resources for the right tasks will also help you minimize your bill and enjoy a truly efficient and seamless experience.
How does the technology provide the ability to start / stop compute resources ?
Can compute and storage scale independently for cpu, memory, capacity and performance ?
How does the scaling model provide for workload isolation ?
Does the technology provide APIs to automate environment spin-up and spin-down ?
Does the technology provide elastic scaling capabilities ?
What controls are available to avoid cost overruns ?
Modern cloud data warehouses are progressing and becoming much easier to manage than what you’re used to, and this should be a key consideration in your evaluation process. Aim to replace time and resources spent on non productive tasks with valuable data analysis and development.
Make sure your cloud data warehouse simplifies the following:
Does the technology require infrastructure management (servers, storage etc) ?
What specific infrastructure management is required ?
Does the platform require management tasks such as vacuum, optimize, coalesce etc to maintain efficiency ?
Is the offering built as IaaS or SaaS ?
How much time, resources and manpower will you need to spend on maintenance ?
Does the technology require proprietary, non-transferable skills ?
What ecosystem integrations ( ingestion tools, transformation, observability, visualization etc) does the platform support ?
The process of understanding cloud data warehouse pricing models is not straightforward, as they are dependent on different parameters like speed, scale and usage. Common pricing models include:
The ‘Pay Per TB Scanned’ model includes storage and executed query costs. This means that pricing heavily depends on usage and the size of your workload. While this pricing model works well for small-mid sized data sets, it starts to become pricey when dealing with big-data use cases, where a lot of data needs to be scanned.
The cost of this model depends on how much you use the platform, performance requirements and dataset sizes, and users are typically charged per hour or second.
In this model the vendor doesn’t make any profit from the cloud resources customers consume. The incentive behind it is to give companies the liberty to maximize performance and usage as well as support future growth, without worrying about rampant costs. Users pay for cloud resources at the base cost and a fixed annual subscription with three tiers based on the dataset size. While this is the most cost effective pricing model, it doesn’t include monthly and on-demand options, which means it’s less suitable for small workloads and lower usage patterns.
What is your data set size and performance requirement today and what will it be in 1-2 years?
Will the pricing and scaling model support your future growth?
Does the vendor charge for data set capacity or consumed-compressed capacity ?
Is the provisioning model coarse or granular ? (for example: if you need to increase memory, can you increase it without increasing the number of nodes or increasing vcpus ?)
What types of storage does the vendor recommend (block, filesystem, object storage etc) ?
How does the vendor optimize storage capacity and performance ?
Data no longer arrives only in predictable and structured formats, it arrives in different types and from different sources. Semi-structured data can enrich your analytics, but most traditional data warehouses aren’t equipped to handle such data. Users waste time and costs on inefficient flattening/unnesting/exploding, which multiplies the number of rows with the number of cells in the arrays. As a result you end up with a much bigger table, more unnecessary costs and painfully slower performance.
A modern data warehouse will enable you to query semi-structured data with standard SQL and without complicated ETL processes which flatten and blow up data set sizes and costs. This can be achieved with native array manipulation functions, and without compromising speed and efficiency.
The right way to handle semi-structured data:
What data sources are primarily driving your use cases ?
Does the technology support semi-structured data ?
What are some recommended strategies for managing semi-structured data ?
Does the technology provide the ability to store raw semi-structured data as well as flattened data structures ?
The standard business definition is “the number of users using the system at the same time”. For a database it's “the number of queries executing at the same time”.
On the surface it may seem those definitions are very close, but in reality they can be miles apart. A very targeted data app, for example, can have thousands of simultaneous users and yet almost no database concurrency if queries are executing in milliseconds and there are very little collisions. On the other hand, a single person browsing dashboards on a business intelligence tool will have 15 simultaneous queries running much of the time as the tool will execute 15 widget queries at once before queuing.
This makes it a bit tricky to get a real answer when asking a software vendor how their platform handles concurrency. If evaluating databases be sure to determine the actual number of simultaneous queries you expect, if possible, rather than users.
Some vendors impose limitations on concurrency, allowing users to submit only one query at a time and setting a cap on the amount of concurrent queries per account. Moreover, even without limitations, concurrency can lead to performance degradation, which brings us back to elasticity. It’s important to easily be able to add resources on demand to support business growth without compromising performance.
Consider a cloud data warehouse which does not limit concurrency or sacrifice performance. And remember, your business will grow. Your needs today are not your needs tomorrow.
How many users will be querying the data simultaneously?
Will one user need to run multiple queries at the same time?
What are your future needs?
What is the effort entailed in increasing the amount of concurrent queries as the business grows?
What are the cost implications of scaling concurrency ?
Is Workload Management required to manage concurrency ?
A common way to bypass performance constraints is aggregations. But how many times have you prepared a report and thought you only need it at the category level, just to discover a few months later that someone needs it at the product level?
The issue with choosing the right levels of granularity is that detailed data can be too voluminous. That’s why granularity is another reason not to compromise on the first factor - choose a platform that supports high performance at scale, without a heavy cost tradeoff.
For example, technologies like sparse indexes enable users to only pull rows that are relevant for the query at the most granular level. This is crucial for performance in data lake environments, as fetching unnecessary data from the low storage layer has a huge performance penalty. Bottom line, think twice before you agree to compromise granularity for performance, a good platform will provide both.
What level of granularity do your workloads need ? ( eg: partitions, micro-partitions, files, block level etc)
What is the granularity at which the platform operates ?
Does the platform require add-ons to address granularity ?
Needless to say, you must be able to deploy the data warehouse you select on the cloud you’re using. Some data warehouses are exclusively deployed specifically on AWS, GCP or Azure while others offer multi-cloud deployments.
Most leading solutions are available on AWS, due to its dominance in the public cloud market. AWS has a huge array of services, as well as the most comprehensive network of worldwide data centers. According to Gartner, "AWS is the most mature, enterprise-ready provider, with the deepest capabilities for governing a large number of users and resources."
However the ideal approach is multi-cloud, which enables companies to avoid vendor lock-ins and provides flexibility to negotiate rates and capitalize on services offered by different cloud providers.
Is your current cloud platform lacking features that you can find in other cloud platforms?
What do the security and operational models look like as you go across multiple clouds ?
What ecosystem integrations does the cloud provider offer ?
Healthy ecosystem partners are important for a smooth integration with the tools you already use. Typically data warehouses provided by cloud vendors will have the most extensive integrations with the other tools the vendor offers. Seamless integration with your BI tools, ingestion frameworks and data lake will substantially shorten your time to market.
Which tools are included in your stack ?
How easily do they integrate with the cloud data warehouse?
Does the platform provide API and SQL access ?
Does the platform support JDBC, ODBC drivers ?
Some use-cases essentially require real-time analytics, like fraud prevention, predictive maintenance and operational dashboards. But as a rule of thumb, for any use-case, the fresher the data, the more accurate the analytics will be. Looking at a batch report from yesterday is not as valuable as analytics which are continuously up to date. This is key to improving any service.
Choose a cloud data warehouse that supports streaming along with batch at high performance. The more data you’ll be able to make sense of faster will determine how fast you’ll adjust to changing conditions and be ahead of your competitors.
Does the evaluated cloud data warehouse support continuous ingestion of data?
What integration patterns do you need with the chosen platform ( batch, micro-batch, streaming etc) ?
How does the platform address data latency and query latency ?
Evaluating the data warehouse platform needs to be done holistically, by understanding not only feature-functionality differences but by focusing on downstream impact of must-have capabilities and nice-to-have features. Downstream impact can be long lasting in terms of customer satisfaction, cost and operational impact. Additionally, every data warehouse vendor offers additional bells and whistles that could come at a premium. These could be features built around machine learning, data integration, data sharing, data catalog etc. Each of these could be addressed with a best of breed approach, without lock-in to a single vendor. Besides, a swiss army knife is hardly the best tool for all home repair jobs. Start the evaluation with the core platform and what the downstream impacts are in mind.
Open the comparison
template in Google sheets