Part 1 - The 4 requirements for an ad hoc analytics architecture
Every company needs to perform ad hoc analysis. Just look around and listen to the report requests. There are plenty of ad hoc analysis examples:
- Why did sales drop?
- Why are product returns up?
- Why is this KPI low?
Any time you see a result you do not understand, especially something you’ve seen for the first time, you probably need to do some ad hoc analysis. The most common ad hoc analytics tool is still Excel. That’s in part because most data warehouses do not support ad hoc analysis.
What you need for ad hoc analysis
You can implement ad hoc analytics infrastructure that beats Excel, or at least complements it. But you need to mix the right technologies with the right best practices. You also first need to realize that most analytics technologies are either too slow, too inflexible, or both.
Because it’s important to understand the requirements, I’ve written two blogs. This first blog is an explanation of the 4 requirements for ad hoc analytics, and why most data warehouses and analytics fail:
- Optimize for performance
- Add any new data at any time
- Perform any ad hoc analysis with no planning
- Run any ad hoc analysis at any time
Part 2 - the right ad hoc analytics architecture is an explanation of how to implement the technologies and processes you need to support ad hoc analysis. As an industry we have spent over a decade trying to deliver self-service analytics - where analysts, employees or end customers can just do ad hoc or build new reports. I believe we are nearly there.
You can also watch the webinar: How to survive ad hoc and interactive analytics with big data, which goes through some of the challenges and best practices.
Requirement 1. Optimize for performance
When companies tried Hadoop, they were told Hive and Presto could support ad hoc analysis. But analysts couldn’t add new data on their own. That usually required an engineer. Hive and Presto also are not as fast as a data warehouse that optimizes storage and compute together.
Then came cloud data warehouses including Amazon Athena or Redshift, Snowflake, and Google BigQuery. Many cloud data warehouses use caching and other techniques to accelerate repetitive query workloads. But they are horrible at ad hoc analysis because it takes a long time to fetch new data from remote storage. Benchmarks like the one FiveTran does each year show Snowflake, Redshift, Athena and Presto clusters that cost $150K-300K ($75K for Presto) annually all take 8-11 seconds on average to run a host of different queries against just 1 terabyte of uncached data, what we call “first-time” queries.
Now, you can get faster performance. We have known for some time what it takes to improve performance beyond just columnar storage. Indexing and pre-calculations, for example, made a big difference. Some specialized servers like Apache Druid have proven that you can improve query performance by denormalizing and aggregating data, and then adding some specialized indexing.
The problem is most data warehouses do not add the features needed, or give you the option to tune enough to get that performance.
Whatever the technique, ad hoc analytics infrastructure needs to be 10x faster, or 1 second or less (almost) every time, because during ad hoc analysis, you shouldn’t have to wait.
Requirement 2. Add any new data at any time
Even the fastest analytics servers fail with pure ad hoc analysis because ad hoc is more than just speed. It requires flexibility too. With Druid, for example, you first have to write ETL processes to get the data down to a single fact table first, pre-ingestion.
You cannot do ad hoc analysis with traditional ETL, because ETL takes weeks to change. It is often built and maintained by a different data engineer or group using Spark or some other specialized ETL tool. Most ad hoc analysis needs to be done yesterday. That means they might settle for 1-2 days.
You also cannot do ad hoc if you have a slow, controlled change process or have to control access. With a traditional shared data warehouse or analytics server deployment, everything needs to be tested whenever we make any change to make sure you don’t break anything for others. That can take weeks or more.
If you want to beat Excel, you need to replace ETL and data warehouse processes. Otherwise, be prepared to do lots of one-off SQL extracts for Excel.
Requirement 3. Perform any ad hoc analysis with no planning
Many experienced analysts will tell you if you have to structure all the data and optimize queries in advance, it’s not true ad hoc. The moment you have to plan, prune and tune, it takes days or weeks.
There are two reasons why this takes days or weeks. The first is that they usually have to go to a different person or team to get the changes done. The second is the changes usually need to be tested. If you got rid of both conditions, it would no longer be an issue.
Again, take Druid, which is built around a single denormalized fact table. You can’t just keep all your detailed data, keep all your columns and add any metrics. You have to pick only the important columns and metrics, and prune and aggregate your data to improve performance. You can’t just then add new data or optimizations on your own, because most Druid deployments are shared, and any changes can impact everyone else.
Most data warehouses require some massaging and optimization of the data in advance to improve performance, and most data warehouses are shared resources, so they require a lot of testing
Requirement 4. Run any ad hoc analysis at any time
The modern elastic, auto-scaling cloud data warehouse does come at a cost, and it’s a big one. A massive or accidental query can cost you a fortune. It can blow your monthly budget. There is a solution. You can throttle usage, Snowflake has some decent controls for throttling. The problem is analytics can be so expensive that you are actually preventing good analytics from being done.
The right ad hoc analytics infrastructure should make it easy to execute any ad hoc analysis, not throttle it. If you are throttling a lot, you probably have an unmet analytics need. In many cases, those accidental queries are for ad hoc analysis or data science.
So what is the right architecture for ad hoc analysis
If you understand and believe these are the right requirements for self-service ad hoc analysis, the best architecture may be clear. Over the last decade a lot has gone into getting data lakes right, in part to improve self service. It is more how the modern data warehouse needs to support ad hoc analytics.
That is for part 2 - the right ad hoc analytics architecture.
You can also watch the webinar and ask your questions.