How to choose the best analytics engine for each type of analytics
There are so many engines to support analytics that figuring out the best technology can be really confusing.
There are data warehouses for general-purpose analytics, and data lakes to help improve scale and access to data. There are also all kinds of specialized analytics databases that deliver better performance for very specific types of analytics. The list includes:
- Timeseries databases like InfluxDB and TimeScaleDB
- Hybrid timeseries databases Apache Druid, Clickhouse, and Apache Pinot
- Search-based technologies such as Elastic also support analytics
You might think that choosing one data warehouse will work. It hasn’t so far. Most of the great data-driven companies I’ve seen support their interactive and real-time operational and customer-facing analytics with (at least) one data lake, one data warehouse, and sometimes several specialized engines for operational and customer-facing analytics.
So how should you choose the right engine(s)?
You can make the right choice for each new project by answering three questions:
- What are your performance requirements?
- What data do you need?
- What types of analytics do you need to perform?
If you don’t need query performance of a few seconds or less, you could probably go with almost any data warehouse. Otherwise if you do need seconds or less, and are willing to give up the general-purpose analytics of a SQL-based data warehouse for a specific use case, you should consider a specialized database.
More recently, it turns out you don’t need to sacrifice analytics or SQL for performance. You can have the simplicity and analytics flexibility of Snowflake with the speed of Clickhouse or Druid. Firebolt is the first to prove it.
Let’s go through each choice in more detail.
Self service analytics enabled by data lakes
If you are going to deliver new analytics faster, you need a data lake. A data lake is a data storage layer (not compute) that makes all your data more easily accessible. When done right, it enables data engineers to perform ELT using SQL or their tool of choice to load any new data into their engine of choice for any analytics workload.
In other words, a data lake is great for those situations where you do not know the data or the analytics you need in advance.
Why do we need a data lake in addition to a data warehouse? The main reason is because most data warehouse deployments have had two big challenges.
First, it takes too long to deliver new analytics using ETL and a data warehouse. It often takes weeks to gather requirements, weeks for a separate ETL team to find and load the right data, and then 1-3 months to finally get the right results. Of course, the business usually wants the new analytics yesterday.
A data lake, SQL-based ELT data pipeline, data warehouse, and DataOps-driven team takes this time down to days or hours. The data lake gives data engineers direct access to any new data they need instead of having to rely on a data integration team.
What you give up to get the flexibility of a data lake is performance. When you don’t optimize your data for a particular type of analytics, you cannot expect performance.
You can either implement a data lake as just storage or combine storage and compute. If you rely on one or a few compute engines to access data, it can limit your company’s access to the data. This doesn’t mean that you shouldn’t provide an engine for access. I do not consider Presto to be restrictive, since it works with any data lake format. If you implement a storage-only data lake, you should probably assume Presto will be one of those engines and consider providing it as a service anyone can use.
Several “data lakehouse” vendors such as Databricks and Dremio do couple compute and storage. The benefits you get from it, including performance, additional features, and resources may be well-suited for your specific needs. But the performance improvements in lakehouses still come mostly from caching, which only works well with reporting and other “repetitive” workloads where you use the same data multiple times. They are not not as well suited for interactive or ad hoc analytics that need new data from storage. Lakehouses can also restrict access and cost more because they require you to access the data through their engine.
A lakehouse may be good-enough performance for certain use cases, and you should consider them. Databricks is great for ETL and data processing in general, and Delta Lake makes it even better. Just make sure you choose a data lake architecture that ensures all your downstream analytics will be able to consume your data. That includes data warehouses.
High performance analytic apps with analytics databases
If you have a data warehouse, why do you need an analytics database?
That brings us to the second big challenge with data warehouses: most are too slow to support interactive analytics at today’s data scale, especially with big data. No matter what improvements were made in the past, the growth in big data continued to overwhelm most deployments.
In response, a host of specialized analytics database startups emerged to deliver performance for specific types of data and analytics. This included timeseries databases like InfluxDB and TimescaleDB, as well as hybrid timeseries databases like Druid, Clickhouse, and Pinot. Elastic and other search databases also emerged for search and related analytics against unstructured data.
The hybrid timeseries database vendors focused on making performance great for specific types of “ad hoc” analytics such as group by and filter operations, not just time-oriented aggregations and metrics. Most implemented what was basically a single wide table with specialized indexes. They believed they needed to sacrifice general-purpose analytics to get sub-second performance. The most obvious example is joins. While Clickhouse and Druid do support fact-to-dimension table joins, they are not used for analytics as much as limiting changes in data to the dimension tables so that the fact table does not need to be re-ingested.
All of these databases require you to perform extensive ETL to structure the data as a single wide fact table, or a single merged message. Most manage their own ingestion process, performing compression, indexing, and other optimizations as they load each file or message into the table.
Once you adopt a specialized database, by the time you get to version 2.0 of a project you should see the big challenge. With an analytics database, you need to know in advance what your data is and what you’re looking for. The moment you need new data, or more detailed data, you often need to rerun your ETL and reingest all your data.
Analytics databases also do not decouple storage and compute. It means queries and workloads are not easily isolated. As a result, most deployments end up putting in the same rigid controls we saw with traditional data warehouses to prevent rogue queries or new data from impacting others. That eventually cripples the ability to add new analytics, which can stifle innovation and prevent new insights.
In short, analytics databases are a solid option for specific types of analytic apps where the data and the analytics do not change (much.) But they are not good for general-purpose analytics, ad hoc analytics, or for any kind of use case such as root cause analysis or troubleshooting new issues that may need new data and ad hoc combined.
Search - I still haven’t found what I’m looking for
Every time I see an Elastic cluster with analytics, I think about that U2 song “I still haven’t found what I’m looking for.” Elastic is great for search. But it can become really expensive to implement analytics with Elastic. Elastic’s analytics is not built on a columnar store. It needs to build indexes to support each type of query. In other words, Elastic will work for you when you do not know the structure of the data but you do know the structure of the analytics. Once you find something you want to watch for and have all the data in Elastic, it can work. But if you build a lot of indexes on Elastic, it can end up being a huge cost and take a long time to load. I remember seeing one deployment that took 10 hours to load 250GB of data into Elastic, mostly because of the time needed to build all the indexes.
There are definitely some types of monitoring and other analytics that make sense with Elastic. But before your analytics get too big on Elastic, make sure to evaluate your alternatives.
General-purpose analytics with data warehouses
Data warehouses were created to let anyone perform any analytics against structured, high quality data. In other words, you know the data in advance, but not the analytics. Decoupled storage and compute means a modern cloud data warehouse can now support almost any scale and allow any new analytics and new insights without disrupting others.
There is no alternative to a data warehouse for general-purpose analytics if you need performance. A data lake cannot replace a data warehouse if performance and simplicity matter. Nor can a data warehouse replace a data lake because data lakes give data engineers self-service analytics. You should have both.
But now there is an alternative to specialized databases. Data warehouses with decoupled storage and compute are increasingly able to support high performance analytics even better than today’s specialized databases. This should not be too surprising. There is no reason a great SQL query optimization and execution can’t be as good if it implements indexing and pre-compute. It’s just that existing data warehouse vendors didn’t do the work. Decoupled storage and compute actually reduces the amount of data used, because each query only pulls the data it needs. It doesn’t have to scan the entire table each time, and doesn’t have to share resources.
Firebolt has shown that you can have the speed of Clickhouse or Druid with the simplicity and flexibility of Snowflake, and about 1/10th the cost of Snowflake, from gigabyte to petabyte scale. Many of our customers already had a data warehouse and databases like Druid, and decided to add Firebolt.
Make future choices easier
Choosing the right analytics engine is getting easier, in part because data warehouses are getting faster and can cover more high performance analytics use cases. But you can make it even easier on yourself with two simple rules.
First, make sure you reach out internally to see what other analytics engines are being used, both to learn from them and leverage their expertise. You’ll hopefully end up with fewer technologies and a good pool of internal resources.
Second, look ahead before you leap. Build out your 1-3 year analytics roadmap before you choose another analytics engine. Think about your data growth, how your analytics need to change in each project, and your analytics needs for your future projects. You may find that the broader analytics capabilities, flexibility, and multi-workload support you get with a high-performance data warehouse better suits your broader needs than choosing yet another specialized analytics database one project at a time.