When selecting a cloud data warehouse, technical and cost constraints make users compromise on certain features. The following checklist of criteria was written to help you determine which factors are most important for the success of your organization.
While evaluating the 10 parameters in the template, consider the following:
In many cases, for higher performance you’ll have to consume more compute resources resulting in heavy costs. So if your organization has such requirements it’s important to examine the data warehouse architecture, and ensure that performance is achieved efficiently.
Modern cloud data warehouses enable users to analyze large amounts of data at high granularity, and with near real-time query response time, by combining a range of tailored techniques including compression, low-footprint indexing, sparse indexing, cost based optimizers and JIT compilation.
How fast can this data warehouse run, on how much data and with how many resources (affecting costs)?
The notion of decoupling storage and compute enables seamless scaling up or down to support any workload, amount of data and concurrent users. By isolating resources, different teams can start/stop compute engines for different workloads/tasks like ETL, heavy querying and exploration while getting the performance that they need. Choosing a cloud data warehouse which provides full control to allocate the right resources for the right tasks will also help you minimize your bill and enjoy a truly efficient and seamless experience.
How simple is it to dynamically allocate different resources for different tasks?
3. Ease of Use
In the long run, it's important to replace time spent on non productive infrastructure tasks with valuable data analysis and development. These non productive tasks include:
- Cluster management
- Complicated scaling up and down processes
- Data updates / deletes with lengthy ETL processes
- Pre-aggregation ETLs for better performance
- Flattening / normalizing semi-structured data with dedicated ETL processes
- Resharding across nodes
- Data vacuuming / compaction / defrag
- Endless index and performance optimizations
How much time, resources and manpower will you need to spend on maintenance as opposed to insight discovery?
4. Cost Efficiency
The major bottleneck in achieving better performance is usually cost. Every incremental query costs money, and big, complex queries cost a lot of money. That's why you need to ensure your data warehouse works efficiently, e.g., you need a minimum amount of resources to elastically scale and increase performance.
What is your data set size and performance requirement today and what will it be in 1-2 years? Is the data warehouse efficient enough to support your future growth?
5. Supports Structured and Semi-Structured Data
A modern data warehouse will enable you to query semi-structured data with standard SQL and without complicated ETL processes which flatten and blow up data set sizes and costs. This can be achieved with native array manipulation functions, and without compromising speed and efficiency.
Will you get the same experience as working with structured data when working with semi-structured data? What will be the tradeoffs of working with semi-structured data?
Some vendors impose limitations on concurrency, allowing users to submit only one query at a time and setting a cap on the amount of concurrent queries per account. Moreover, even without limitations, concurrency can lead to performance degradation, which brings me back to my point about elasticity. It’s important to easily be able to add resources on demand to support business growth without compromising performance.
- What’s the size of your department? How many users will be querying the data simultaneously?
- Will one user need to run multiple queries at the same time?
- What are your future needs? What is the effort entailed in increasing the amount of concurrent queries as the business grows?
7. Data Granularity
A common way to bypass performance constraints is aggregations. But how many times have you prepared a report and thought you only need it at the category level, just to discover a few months later that someone needs it at the product level?
The issue with choosing the right levels of granularity is that detailed data can be too voluminous. That’s why granularity is another reason not to compromise on the first factor - choose a platform that supports high performance at scale, without a heavy cost tradeoff.
Will you have to sacrifice granularity to achieve performance?
8. Deployment Options
Needless to say, you must be able to deploy the data warehouse you select on the cloud you’re using. Some data warehouses are exclusively deployed specifically on AWS, GCP or Azure while others offer multi-cloud deployments.
Is your current cloud platform lacking features that you can find in other cloud platforms?
9. Ecosystem Integrations
Healthy ecosystem partners are important for a smooth integration with the tools you already use. Typically data warehouses provided by cloud vendors will have the most extensive integrations with the other tools the cloud vendor offers. Seamless integration with your BI tools, ingestion frameworks and data lake will substantially shorten your time to market.
Which tools are included in your stack and how easily do they integrate with the evaluated cloud data warehouse?
10. Data Freshness
Some use-cases essentially require real-time analytics, like fraud prevention, predictive maintenance and operational dashboards. But as a rule of thumb, for any use-case, the fresher the data, the more accurate the analytics will be. Looking at a batch report from yesterday is not as valuable as analytics which are continuously up to date. This is key to improving any service.
Does the evaluated cloud data warehouse support continuous ingestion of data?