As a solution architect who has worked on various Big Data platforms and projects, it is interesting to see the progression of technologies in the cloud world. The lambda architecture approach taken by Druid is innovative, and very cool. In addition, when you see the fast response times for aggregations, Druid really does add unique value to existing analytics environments. However, I always felt that Druid implementations were hardly straightforward. To get the most out of it, I had to understand various elements of Druid - Broker, historical, overlord, coordinator, middle manager, and more. I always wondered if I would be as effective with running analytics on Druid if I were a newbie. There is this Apache Druid muscle I would have to build. Could it be easier ?
Having spent time on Firebolt, I see that low latency querying can be delivered without the need to scale various parts of a specialized Druid infrastructure. Firebolt approaches the problem from a Saas perspective. Firebolt offering itself is about ‘sub-second analytics as a service’ where the complexities that I saw in Druid are totally eliminated. AlI need to know is how many scale-out engines I need and that is it. These engines in Firebolt are instantiated from various Amazon EC2 instances with a mix of vcpu, memory and local storage. Pretty simple and straightforward.
On another note, Druid does not leverage decoupled compute and storage as do most modern analytics solutions. While Druid uses deep storage for persistence, it does not use this data for queries. Due to the reliance on memory and internal storage, I have seen most Druid clusters needing dozens of compute nodes. Not that the results are not there, but more along the lines of can you reduce the infrastructure size and drive better price-performance and smaller environmental footprint.
Decoupled compute and storage is a key design element in Firebolt, allowing independent scaling of compute and unlimited storage. The building blocks from a Firebolt perspective are compute nodes and object storage in the form of S3. With S3’s scalability and the ability to have scalable compute - both parts of the infrastructure scale independent of each other. This separation of storage and compute also enables the turning on and off of Dev/Test/QA environments easily against the same set of data. Pretty convenient.
As far as data ingestion is concerned, Druid has done a great job of integrating streaming sources and batch processing. The data gets ingested into two separate piles of data within Druid and then unified seamlessly at query time. Good stuff. Firebolt does not take this approach. Instead, opting for a 30-sec micro batch ingestion. All data being ingested into a native columnar format that is sorted, compressed and indexed. If you need real time data analysis that is fine grained than the 30 seconds, Druid might be the option. This is an area that Firebolt is working to optimize and improve on. BTW, I felt that the ingestion spec from Druid was not a very friendly approach to incorporating data sources. While it enables ingestion being defined as code, it is much more complicated than a “insert into my_fact_tables select * from an external source” approach that Firebolt takes. On top of that, if you need to address semi-structured data, nested JSON needs to be flattened prior to loading into Druid. In contrast, Firebolt provides Lambda expressions, extensive JSON manipulation functions and the ability to store data natively in arrays. This range of options absolutely help when working with JSON data.
Once the data is ingested, the biggest bang for the buck in Druid or Firebolt comes from the ability to effectively prune data or aggregate data. At query processing time, if you are constantly generating full table scans it will be hard and will get harder to deliver fast response times. So pruning techniques like partitioning truly come in handy. But then, partitioning needs to be defined and managed. IMO, Druid makes it absolutely simple to partition data. Data partitioning is based on timestamp. So if I have Time series data, out of the box partitioning without thinking is great. But the challenge is it does not give me the flexibility to partition across other columns. When you need that flexibility you use secondary partitions. While a secondary partition can be specified, if queries are based on the secondary index, it will require the scanning of all time series data.
Firebolt takes a different approach. Let's take a look at the DDL below:
Interestingly this data can be partitioned using any one of the following partition definitions.
Interestingly, there are two elements here. For one the Primary Index clause effectively prunes data across store_id, product_id using sparse indexes. This flexible partitioning allows for effective pruning, even when you aren't querying time series data.
Talking about aggregations, one of the coolest features of Apache Druid is the ability to do “Roll Up” of data at ingestion. Similarly, Firebolt aggregating indexes are effective as well.
At a high level, Druid roll-up takes high-frequency event data and pre-aggregates it to a higher granularity (which has a huge impact on performance). BUT, the trade-off is that you lose the ability to view the event-level data. So, it’s a pretty common design pattern to have multiple data sources at different granularities.
In other data warehouses, this same principle applies, with event data being aggregated to summary tables, materialized views, or a combination of both.
Here’s the solution from a Firebolt perspective:
You can have a single table, with MULTIPLE aggregating indexes. These aggregating indexes can be at multiple granularities, all while maintaining the event-level data. The end-user doesn’t even need to know about the indexes, as the query planner will automatically re-write queries to use indexes as they are available.
All of the performance benefits, none of the complexity.
Druid roll-ups summarize data based on an ingestion spec. To address lack of granularity in roll-ups, multiple independent roll-ups will need to be defined and maintained. Firebolt’s aggregating indexes are defined once, incrementally auto-synced at ingest and data can be queried at index speed immediately. All with online access to raw data.
I have always been a SQL fan. While Druid supports SQL, not all features available in the native query language are supported with Druid SQL. To add to this, if you have a need for native joins, you might find that this is not core to Druid’s strengths. The general approach I had adopted was to use denormalized data simplifying the query process. In contrast, Firebolt has doubled down on SQL and SQL IDE for SQL queries and scaling. Additionally, Firebolt can be orchestrated through JDBC, the API, or SDKs or through a growing set of modern integrations including the likes of dbt, Airflow, Superset, Cube and others.
Fundamentally, Druid is a great offering but it is complex and requires a fairly steep learning curve. Firebolt provides an alternative to Druid, delivering fast response times, high concurrency, and the convenience of a SaaS data warehouse. If you are a Druid practitioner, you will most likely be pleasantly surprised with what Firebolt offers, just like I was.
If you’d like to learn more about Firebolt:
- Read the Firebolt Whitepaper
- Contact our team of skilled Solution Architects