The data warehousing market has gone absolutely mad over performance. Why is this the case? Why does performance continue to matter so much? And why is Firebolt betting so much on performance? It seems so boring! After all, the current technological landscape is far more about ease of use, elasticity, and collaboration than raw performance.
In this blog we argue that performance is actually not about performance at all! We’ll contextualize real-world customer needs of data warehouse performance, and we’ll even make a bold prediction about the future of data warehousing (preview - it’s all about the newCDW).
About Those Performance Wars…
We’ve witnessed quite a few performance-related hog-wrestling matches recently. BigQuery vs Redshift in 2016. Synapse vs the world in 2019. Databricks vs Snowflake in 2021. Hey, even Hightouch vs Census in 2022 for good measure! It’s all posturing and chest-thumping. Even we at Firebolt have gotten carried away sometimes, although we strive to be as factual and honest as possible.
In a vacuum, standard benchmark performance doesn’t matter, and neither does its snobbier cousin price-performance:
- Standard benchmarks poorly represent customers’ own workloads.
- As some folks demonstrated, opportunities for behind-the-scenes optimizations are abundant.
- Real workloads are often volatile, elastic, and erratic. It’s far more important to know how a system responds to adversity than how it performs under ideal and stable conditions.
- If you need more speed, just spend more money and throw more hardware at the problem (to an extent).
In the end, you can park a battleship in the amount of ambiguity around this topic. This is why every company under the sun is claiming superiority. In my book, to be admissible, benchmark performance differences need to be in orders of magnitude. Otherwise… It. Just. Doesn’t. Matter.
You are probably thinking… So if performance is not about performance, then what is it about, genius?
It’s All About The Workloads
A data warehouse can be thought of as a collection of workloads (e.g. BI, ELT, ad-hoc). However, not all workloads are made equal. Some workloads, such as data-intensive applications, are specifically sensitive to performance.
Why’s that? My former employer Google demonstrated that 53% of mobile website visits are abandoned if a site takes longer than 3 seconds to load. User expectations of data applications closely resemble those of mobile websites. Thus if your data application is powered by a data warehouse, you better hope most of your queries finish in under 3 seconds.
For data application workloads specifically, your data warehouse has to be sufficiently fast. To use a bad analogy, your data warehouse is like a child trying to ride a scary rollercoaster (e.g. serve data app workloads). But it must be tall enough (fast enough) to ride the data-app roller-coaster. If it is not, it’ll have to settle for bobbing for apples.
You must be this tall to ride the roller coaster
Now you understand why performance is absolutely not boring. More remarkably, I’ve thus proved myself wrong.
Performance at Firebolt
Here at Firebolt we obsess about performance precisely because we are laser-focused on helping our customers build interactive data-driven applications. How do we achieve our performance?
For one, organizing the data well is crucial to good data warehouse performance. To that end, Firebolt leverages sparse indexes, aggregate indexes, join indexes, and magic to make sure that we can find, scan, aggregate, and join the data as quickly and efficiently as possible.
That’s not enough. Firebolt got its head start in performance by standing on the shoulders of giants. Under the hood, Firebolt forked arguably the fastest runtime out there, Clickhouse, and extended its capabilities with data warehouse primitives, managed storage, managed metadata, the service layer, a bespoke distributed processing layer, and so on.
However, we’re not stopping there. We’re investing heavily in doing all kinds of fun and crazy things. 2022 will be a very exciting year!
Future of Data Warehousing - NewCDW?
So what does the future look like? Well, to know the future one must understand the past.
The data warehousing market can learn something from our OLTP friends. In the mid-2000s relational databases were hitting their limits - they simply couldn’t scale. Folks struggled sharding and re-sharding MySQL dozens of different ways, and it was never pretty.
Today, NewSQL databases like Spanner, Aurora, Cosmos DB, Cockroach, and Yugabyte are promising the best of both worlds - scalability of NoSQL with features and semantics of RDBMS.
I propose that a similar trend is underway in data warehousing. Modern cloud data warehouses are highly scalable and performant, yet struggle in sub-second workloads like data applications.
Enter what I call NoCDW. Technologies like Druid and Clickhouse are far more narrow than a state-of-the-art CDW. However, the sacrifices they make are strategic bets on other strengths, strengths that translate to superior ability to service data application workloads.
And now the hot take. Just like NewSQL emerged in OLTP, I believe we are in the early stages of NewCDW - cloud data warehouses that are versatile enough to service both classic data warehousing workloads (e.g. ELT, BI, ad-hoc, etc) and data-intensive workloads, with semantics and features we’ve all come to love.
Not a single vendor is sitting idle here. Cloud data warehouses, faced with innovators’ dilemma, are driving performance down (or is it up?). NoCDW vendors are trying to separate storage and compute, add mutations, and ACIDify their operations. Legacy vendors like Teradata and Exasol are cloudifying their on-premise offerings.
Who is going to come out on top? I myself am betting on Firebolt, but I may be biased…
If you’d like to learn more about Firebolt:
And of course we’re hiring world-class talent!