On a recent LinkedIn post I shared results of a POC we did at Firebolt, where we ran queries that were previously run on Snowflake. The results were great - huge performance gains over smaller and cheaper clusters. The truth is that I didn't put too much thought into this post - it was shared pretty randomly after I was happy to see such great results.
What took us by surprise, is how much attention in views/comments we got from this post. Conclusion: Mention Snowflake = Get attention :-)
In the post comments I got a lot of request for more information on the queries we ran. I did answer a couple saying that it wasn't a big dataset (~0.5TB) and that the queries were rather simple. One comment was skeptical that a simple query on Snowflake wouldn't be fast. So I decided to expand on that topic and share more information on the actual queries.
Part 1 - Why even simple queries can be slow in cloud data warehouses
Unlike in the on-premise world, modern cloud data engines have a much more complicated relationship with storage. In the cloud we enjoy infinite storage, which is great. But the infinite storage layer of S3 (I will talk in AWS lingo throughout the post, but the same concepts/challenges are true for all cloud providers) is far from optimal for performance.
When queries aren't waiting in the cache, and the query engine has to scan data in S3, that's when you typically go and get coffee. Queries are too slow for interactive analytics if large data scans over S3 are involved. This is something that most Athena/Presto users know very well. Since these query engines scan data directly in S3, they run into slowdowns very often when data volumes are significant.
All cloud data warehouses have various techniques for storing data in a smarter way and serving it to the query engine in a faster way. This is why cloud data warehouses are typically faster than Athena/Presto for non-cached queries.
But at the end of the day, at scale, even the modern cloud data warehouses have to move too much data between S3 and the SSD of the compute nodes to stay performant.
Here's one of the Snowflake queries (fields/values are masked of course):
It's a simple query. No joins, only one big table. The other queries weren't very different conceptually, so let's stick to this example. The query essentially aggregates over a few dimensions, and returns sums with a combination of case statements. The query is filtered by a few of the dimensions.
But the filters are where it actually gets interesting. Because the values by which the data is filtered can be randomly spread across the physical files in S3, finding them requires lengthy data scans. Moreover, Snowflake's smallest chunk of data that can move from S3 to the SSD is a full micropartition, which is 50MB-150MB in size. Each micropartition typically contains much more data than what the filters really need, so the micropartition gets sliced only after reaching the SSD. Precious time is wasted on data access and fetching of micropartitions in GBs, only to be later sliced by the query filters.
I hope this conveys why even at 0.5TB you can already see simple queries taking dozens of seconds to run in a "Large" warehouse.
Part 2 - How Firebolt uses indexing to prune data and stay fast at scale
So how do you overcome that slowness of moving data from S3 to the higher memory tiers? The answer is in moving less data, and that's where indexing and optimized storage come into play.
At Firebolt we rely on our proprietary storage format, called F3 ("TripleF"). When data is ingested into Firebolt, it is automatically converted and stored in F3. What's unique about F3 files is that they're sorted, compressed and indexed. Tables in Firebolt are defined with a PRIMARY INDEX, which determines the physical sort order of the F3 files. Then, an index is created and works in a tightly coupled way with the F3 files. When queries come in, the index is used for pruning: The index tells the query which very particular ranges of data to pull from F3 - only the ones that actually participate in the query. So unlike other query engines, that transfer big files from S3 which are later sliced, with F3 significant amount of bandwidth is saved due to the index, because only data that the queries need actually gets moved.
So let's look at how we defined the PRIMARY INDEX in Firebolt:
Primary indexes in Firebolt are most effective when they're correlated with the fields that are most commonly used to slice the data. In this POC the "date", "type" and "account" fields are most commonly used to slice the data, and this is why the primary index is so effective.
Conclusion - Your cloud data warehouse won't stay fast for long, but fear not!
The one thing we can probably safely predict, is that if we'll all have much more data to analyze as time goes by. As your data continues to grow, you'll find that you can't escape performance challenges forever, because the existing approaches to storage and query engines in the cloud aren't fast enough for analytics at scale.
Firebolt's optimized storage and indexing approach relies on efficient data pruning that can dramatically increase query performance, without the need for more compute firepower.
Learn more about Firebolt's POC results in our upcoming webinar:
Firebolt POCs Explained - What Makes Firebolt Outperform Snowflake, Athena and Others. Register here!