Have you ever totally overrun your monthly budget for an analytics environment overnight? My personal experience was a courtside seat watching someone's $21K monthly budget get consumed overnight. Here are a few thoughts on how we prepare ourselves for what lies ahead, in the public cloud and in the economy. In this post, we look at factors to consider when building a data warehouse. Our goal is to point out the potholes you are most likely to hit from a cost perspective and what you can do to avoid them. From a Firebolt perspective, we definitely have options to help prevent a costly blow-up.
Where does one start ? How about a quick checklist as you evaluate your choice of data warehouse platform.
- Does the platform give you control over resources: cpu, memory, io, storage ?
- How does the platform leverage de-coupled storage and compute ?
- Does the platform scale-up, scale-out, scale-in or scale to zero ?
- Is the scaling factor an incremental add or multiplier ?
- Does the platform leverage technologies in the form of columnar compression, data pruning, specialized indexes and vectorized processing ?
- What are the cost implications of delivering high concurrency, sub-second analytics ?
Now let’s take a closer look. Step back for a second and think about your data processing pipeline. If you are creating a data pipeline that ingests, transforms, stores, analyzes, visualizes, runs on demand, what are the things that can hurt your wallet the most ?
As a suggestion, start by focusing on the big ticket items. Every cloud bill has varying amounts of consumption items but the ones that sting the most are compute, storage and databases.
Compute costs rule the bill
Cloud Service Providers have figured out creative ways to slice, dice and deliver compute in various shapes and sizes such as virtual machines, containers, functions, serverless etc. Matching the workload to the right type of compute offering will make you efficient.
Let's focus on virtual machines, having granular instance choices will give you the flexibility from a price-performance standpoint. Additionally, there are network, IO, ephemeral storage capabilities that are forgotten. You might end up with a smaller or a larger instance based on how your workload stretches each of these dimensions. In many cases, selecting the right instance types based on detailed workload requirements is critical, especially for analytical workloads. But that is only possible if the scaling model allows for it. Let's take a look at a couple of approaches to scaling and pricing.
A comparison of scaling models, one with fixed scaling and the other with granular scaling choices is shown below. Fixed scaling model shown here provisions compute nodes in multiples of 2 without any control of cpu, ram or storage resources. The only control is over the number of nodes. As a result, if an 8-node cluster does not meet your requirements, the only option is to increase the number to a 16-node cluster. What if you need to incrementally add memory to the 16-node cluster? Your only option is a 32-node cluster. If you have already committed to this model, the best option is to focus on right sizing and auto scaling to deliver results.
On the other hand, the granular scaling model shown below provides multiple combinations of compute providing better cost control. For example, the ability to incrementally adjust for cpu, memory or storage does not require increasing node counts as the only option to control sizing. Shown below, multiple instance choices are available per node count. Additionally, granular control over cluster size allows you to address performance incrementally.
While overprovisioning is possible in the granular model, every architect has the option of making those tough price-performance trade-offs based on actual resource utilization much more effectively.
What about serverless?
The above models are based on virtual machines or instances. Do these problems go away if the data warehouse is built as a serverless offering ? Not really. Serverless models may not have you selecting instances and instance counts, however, they typically leverage an abstract provisioning unit. For example, if you were configuring a Google Cloud BigQuery data set, the provisioning units are called “Slots”. Essentially, a slot is an abstract combination of vCPU, memory and other elements that the provider uses to scale. General guidance is that a slot is equivalent to 0.5 vCPU and 0.5GB of RAM. When you provision a “Slot” reservation, there is a minimum of 100 slots with capacity additions in 100 slot increments. This model is more granular than the fixed scaling model shown above, but still does not provide the control that you can get with a granular scaling model. In other words, there is no way to tune for cpu, memory, io or other. This holds true for other serverless approaches like Redshift or Synapse.
Irrespective of the offering, understanding the provisioning granularity and the workload mix that can be satisfied by serverless offerings critical to controlling costs, especially if the platform provides limited visibility into true utilization. Environments with cpu intensive and memory hungry queries could easily end up over provisioning across both dimensions. If you are already down the road of serverless, building the visibility into how your “slots” or “RPUs” or “DWUs” are being devoured by your queries is a must and watch it like a hawk.
Cloud storage is cheap, but?
Is it true that Cloud storage is really cheap ? Answer is, it depends on the performance profile you need and how you leverage it. What if you need a clustered file system where you pay per MB/s of throughput ? What if block storage is your only option to get low response times ? What if you have to index every column and end up with bloated indexes as it happens with some of the accelerator technologies or even some NoSQL offerings ? Here is a sample cost comparison, based on object storage and block storage offerings, each with a different scaling model and performance profile.
The general trend is to use cost effective object storage to be used as the data lake or data warehouse storage. This provides a low $/GB rate. However, the performance profile of object storage might need some work. To address this, Cloud data warehouses leverage NVMe based Solid Storage drives or other faster storage types. The cost of the faster storage tier typically gets bundled with compute and allows customers to make the most out of cheap object storage.
Compression is another tool that helps get the most out of storage. Using compression, you are reading less data from the slowest link in the chain - your storage medium. Reducing this dependence on disk access complemented with NVMe SSDs, helps strike the balance between price and performance. But not all pricing models charge for actual storage utilization. There are offerings in the market that charge for data set capacity and not compressed capacity. For example, Google BigQuery storage costs are based on uncompressed capacity on disk and for uncompressed capacity scanned, resulting in higher storage and on-demand scanning costs.
Bottom line: Take the time to understand price-performance implications of storage choices and storage optimization strategies that can help. As well as reviewing the pricing model for storage.
Leveraging optimizations to reduce infrastructure costs
While most analytics platforms process large volumes of data, they vary in their capabilities, especially when it comes to efficient processing. Analytics platforms that use optimization techniques can help get the most out of your cloud spend.
First, aggressive data pruning through partitioning, clustering at the dataset, file and block level can help reduce the burden of sifting through all that data. Won’t make it go away, but makes it palatable. In many cases, these require analyzing, vacuuming, and clustering datasets on a periodic basis. Depending on the vendor, you might be paying for these operations too.
Second, indexes introduce additional overhead when it comes to big data, but if you have the granularity of clustering/block level data pruning, combined with columnar compression, the gains can be enormous. Think efficient data pruning with specialized indexes as you start refactoring workloads. While this might sound like something every platform offers, it might be surprising to find out that is not the case.
Finally, another inefficiency we all deal with daily is in the reporting layer. There are reports that are run and re-run all day long. These reports could be aggregations that consume cpu cycles and possibly eliminate effectiveness of any resultset caching. What if a materialized view with automatic query rewrite can be used to reduce the impact of repeated scanning of data by accessing pre-calculated aggregates. This could translate to fewer compute nodes or cheaper nodes with fewer GBs for RAM, resulting in lower costs.
Here is an example of optimization technologies streamlining data access. Leveraging purpose-built indexes reduces the amount of data scanned to a fraction resulting in the fastest data experiences you can deliver.
Processing data efficiently can reduce infrastructure consumption, better customer experience, fewer calls to the help desk and thus helping control your spend. Clustering, analyzing, vacuuming consume people and infrastructure resources from a day two operations standpoint. Picking optimizations that do not require constant babysitting would be a great way to manage costs without adding a whole lot of day2ops burden.
Cost of delivering High Concurrency and Sub-second response times
One common pitfall in evaluating data warehouse technologies is that concurrency and response times typically take a back seat to feature-functionality. If the data warehouse cluster does not address your concurrency needs, the only option is to use multiple clusters. Auto scaling can relieve the operational burden in the form of multi-cluster warehouses, However, the variability in performance during cache warm-up and additional costs can leave a bad aftertaste.
Sub-second response time is a common requirement that most mainstream data warehouses struggle with. This has led to the rise of technologies like Druid, Pinot, Kylin, ClickHouse etc. However, these technologies come with their own set of challenges in terms of operational complexity, specialized skills and infrastructure sprawl. Point is sub-second latency has a high cost that can no longer be neglected.
Architecting for high concurrency, low latency queries up-front will optimize infrastructure and operational costs in the long run. Purpose-built solutions might be required alongside your existing data warehouse platform or a data warehouse that supports these types of use cases might be the right one to start with.
The skills challenge
In the big data world, specialized skills are in high demand. Labor costs for specialized skills are nothing to disregard. How do employers navigate the cost of specialized skills and certifications? If you are focused on controlling your spend, you should definitely focus on developer productivity to address the shortage of skills. Building a team that is able to focus on Rest APIs, SDKs and SQL is a great place to start. This provides continuity within the organization as these skills enable better integration without the need for proprietary tools.
First,customers are dealing with “credit fever” and “slot sprawl” with some of the cloud data warehouse offerings. Understanding the provisioning, scaling and pricing models are important if you want to avoid the after shock of cost overruns.
Second, price - performance cannot be an afterthought. Our suggestion is to take the price-performance tradeoffs head on. Understand your workload profile, look for granular scaling options to fine tune spend, leverage technologies which implement intelligent optimizations and build for the end user.
Concurrency and end user response times matter. If you think otherwise, you will find yourself at the crossroads of poor performance, complexity and cost. Plan ahead for these requirements.
Finally, using technologies that use common skills such as SQL are easier to staff up and support compared to any proprietary stack or platform.
How Firebolt helps
At Firebolt, we focus on innovation and infrastructure efficiency to provide best-in-class performance while optimizing your TCO. This starts with providing customers the choice of instance types and node counts to help scale granularly without the burden of infrastructure management. As an example, a large customer was able to reduce their spend by 10%, by simply switching to a right sized instance, without sacrificing performance. Additionally, if each of the nodes is highly efficient then you can do with a smaller cluster as another customer found out reducing their node count from 8 nodes with a popular cloud data warehouse, down to a single Firebolt node, exceeding the concurrency needs and reducing their spend by over 50%.
On the storage front, storage pricing is a transparent $23/TB, same as the price of object storage. Object storage capacity is optimized through columnar compression. Data access is optimized through an efficient file format and specialized indexes that access granular data ranges enabling high concurrency, sub-second analytics.
All this is available without the need for specialized skills. You can focus on automation through Firebolt APIs, SQL and SDK thus enhancing developer productivity.