Almost all cloud data warehouse vendors claim to be the fastest. The thing is, seamless, high performance experiences tend to come with a trade off that no one likes to talk about - costs. The process of understanding cloud data warehouse costs is not straightforward, as they are dependent on different parameters including the data, queries, user concurrency and the performance requirements. In the following article we’ll review the most common pricing models, their pros and cons and which use-cases they are most relevant for.
The ‘Pay Per TB Scanned’ model includes storage and executed query costs. This means that pricing heavily depends on usage and the size of your workload.
For storage, BigQuery charges $0.02 per GB. In Athena, data is stored in S3 and users pay more or less the same. On top of that, in both platforms users pay $5 for each TB scanned while executing queries.
Per TB scanned billing means users really do only pay for what they use at the moment, and not for any idle time. If your analytics are more “one-off” and not regular daily tasks, this model is a great option.
If you are just getting started, Athena can be a great “getting started” option for running against an existing data lake, and is also super user-friendly, which can save costs spent on resources and developers. That is of course, until your needs grow.
This pricing model can make users very cautious before running a query, because the query might cost too much. Almost every organization has encountered the same painful scenario - a user makes the grave mistake of running a query over a massive data set, which results in a big fat dollar amount for that single query. This usually leads to throttling that prevents people from asking “expensive” questions, or requires them to ask before they run large queries. But throttling defeats one of the main goals of the analytics infrastructure; to make analytics more accessible.
Bottom line, while this pricing model can work very well for infrequent analytics against small-mid sized data sets, it can become the most expensive option for repetitive big-data analytics where a lot of data is scanned frequently. If you have lots of data and users performing analytics, this may be the worst model for you.
In the case of BigQuery, you can move to a flat monthly rate of $10,000 (or $8,500/month if billed annually) for every 500 slots, which are basically compute units. This brings BigQuery down towards Snowflake and Redshift prices for regular-use analytics.
Both Snowflake and Redshift offer on-demand pricing models as well as 30%-70% discounts for prepaying. The cost of these platforms depends on how much you use them, performance requirements and dataset sizes.
Redshift charges per-hour per-node for both compute and storage (coupled together). To calculate costs, multiply the price per hour for the selected node by the cluster size and number of utilized hours - [price per hour] x [cluster size] x [utilized hours]. Keep in mind, your cluster can be formed only using the same type of nodes, which makes it hard to adjust to dynamic use-cases.
Snowflake decouples storage and compute, and charges separately for them. Compute pricing has 4 different tiers - Standard ($2), Enterprise ($3), Mission-Critical ($4) and Virtual Private Snowflake (undisclosed but more expensive) - per node. You can choose from 8 different virtual warehouse (cluster) sizes for compute - 1, 2, 4, 8, 16, 32, 64, and 128 nodes. Doubling the warehouse size doubles the cost. The largest in this case is $512 per hour, $4 x 128 nodes for Mission-Critical Edition, which is around $4.5 Million annually 24x7. The actual charging is billed per second.
Storage pricing is $23 per terabyte per month with an annual commitment, or $40 per terabyte per month on-demand. Even though we chose to put Snowflake and Redshift under the same category, there are critical differences between them. While a benchmark might show that Redshift is cheaper than Snowflake, Snowflake is easier to manage. In addition, because Redshift couples storage and compute, you can end up paying for more for Redshift just to store larger data sets. A true cost comparison needs to take into account your true data size, compute, and administrative costs.
When working with relatively small workloads and when usage is manageable, this flexible pricing model can be very cost effective. By monitoring your performance and usage needs, you can turn compute resources on and off to control costs. Another advantage associated with Snowflake is its ease-of-use. While Snowflake is not considered a cheap platform, it provides a friendly user experience which in the long run can save costs spent on manpower and resources.
Working with large data sets, complex queries or high user concurrency brings out all the deficiencies of most cloud data warehouses. While most cloud data warehouses help simplify administration and scalability, they are not good at improving performance or costs as you scale. The simple answer to improving performance has been to scale compute, which makes more money for the vendor. There aren’t enough options to tune for greater efficiency, or choose different node types and numbers for different workloads.
In the case of Snowflake, for example, you can only keep doubling your virtual warehouse size to scale for increased query complexity, or add more identical virtual warehouses to scale for increased user and query concurrency. This has been referred to as “credit fever”.
This pricing model addresses the conflict of interests described in the previous section. With optimized storage and indexing, Firebolt has made it its main priority to enable higher performance with fewer compute resources, while still maintaining the convenient pay per use model.
Users can select engines from a variety of EC2 instance types optimized for different kinds of workloads, scale up or down from 1-128 nodes on the fly.
See the results of a recent POC vs Snowflake, in which the customer ran 5 queries over a 0.5 TB data set:
Firebolt provides the most efficient pay per use pricing model, which enables users to reach higher performance with less resources. Pricing is completely transparent - Users can always see the rate of the resources they use directly in the user interface and they don’t have to be certified accountants to understand it.
The new approach is especially useful for:
So what’s the right platform and pricing model for you? Key points for consideration:
On a personal note, why should costs stand in the way of an amazing insight? A good solution encourages you to run more queries, on more data, by more users to get more value, and all that without the cost trade off.