How to choose the best federated query engine for each use case
If you are reading this blog, you hopefully have already figured out that you need a federated query engine. If you’re not sure, you can read my previous blog: Choosing between the best federated query engine and a data warehouse. Otherwise, read on.
Federated query engines are great for the infrequent “one-off”, drill-down, or self-service analytics use cases where you can’t have the data in a single place and you don’t need performance to be a few seconds or less. If you need to scale, Presto is great.
There are some simple rules of thumb you can use to choose the best engine for you.
- Athena is a great choice for getting started with analytics if you have nothing set up yet.
- Redshift Spectrum is great for Redshift customers.
- Presto is for everything else, including large data sets, more regular analytics, and higher user concurrency.
But to make a wise decision, you should first know your use cases, your customer requirements including SLAs, and your data sets and queries.
Federated query engine use cases
There are several uses of federated query engines that have emerged over the last few decades. Federated analytics technologies were purpose-built for use cases so it’s important to understand this when choosing the right engine.
Federated queries were first used to simplify data access for application integration and Web services. Eventually products like Composite emerged to let applications query across multiple sources without having to move the data or write a lot of code. Application integration and Web services depended on this type of architecture to simplify development.
They were also used for drill down operations in business intelligence tools, to let you “drill back” from a report back to the original source and raw data. I remember using this in the late 90s within an ad hoc query tool.
This is a valid use case. It’s just not what Presto, Athena or Dremio were designed for. Since they are not part of the ETL process, you would have to build each drilldown manually and expose it as a service you could call from within a BI tool.
About a decade ago, some BI and data integration vendors, including Informatica, which is where I was at the time, realized that an analyst could get around the data warehouse, and query multiple data sources on their own to quickly build a new report. You could then pass the metadata over to the data warehouse team, who could convert the federated query into an ETL job. This was the birth of self-service analytics. This idea and related technologies moved into other related technologies such as data wrangling.
Self service analytics is an ideal use case. There is always the need to create new reports that merge data in the warehouse with some new data. Performance is not as important as getting the report done within hours or a few days. Waiting weeks for a Spark job to be written and run, or for a data warehouse to be tested is unacceptable.
If you implement Presto for self-service analytics, make sure your analysts can do the work on their own or that at the very least there is an SLA of hours for delivering new data. Otherwise your analysts will go around your query engine and use the #1 engine and BI tool that has been the leader for over 30 years; the spreadsheet.
Federated analytics mostly started with BI tools building federated queries within thick clients of the 90s and 2000s. Because the clients couldn’t scale, as data exploded and some departmental BI tools started to become enterprise-wide, some BI tool vendors started to create server-based versions.
In 2012, a few years after self-service analytics started, Facebook built Presto to execute distributed queries at scale across Hadoop and other data sources. They contributed it to open source in 2013. Since then, Presto has become the federated query engine of choice for directly querying data lakes. Amazon Athena, which was released at the end of 2016, is built on Presto. Redshift Spectrum, which is basically Athena inside a Redshift VPC, was released in April 2017. Others like Dremio have emerged as well.
So many vendors rely on federated queries and caching that you need to know how to identify this in a product, because they bring limitations. There is an easy test. No matter how fast they perform with queries once they have the data, how fast are they in accessing the data the first time?
If you are really good at managing federated data, like Facebook, and have several massive data sources you need to manage separately, then you should consider Presto. It will be a lot less expensive at that scale and you will need to manage it extensively to get the performance.
Athena will not deliver the same predictable performance, and it will not scale. If you don’t believe me, look around for the “exhausted resources” error messages from Athena, or why you need to order joins with the largest table first. Any shared resource has to limit each customer for the sake of every other customer.
Infrequent, or on-demand “one-off” analytics
Google BigQuery (which isn’t a federated query engine but has some external table support) introduced a new concept to analytics; serverless computing. It was so popular that Amazon released Athena in 2016, and then Redshift Spectrum to better address the “infrequent analytics” segment of the market.
Shared serverless analytics offerings allowed you, for the first time, to pay as you go either by the amount of data scanned (Athena/Redshift Spectrum) or for the compute (BigQuery) without the need to have a cluster running to get a fast response. The vendor provisions the compute and pulls the data dynamically from different sources.
If you are just getting started with some analytics, don’t plan on running these analytics frequently, and aren’t as concerned about performance, Athena and BigQuery are great. Just make sure you know their limitations.
- They are priced for infrequent usage. If you use it regularly as a data warehouse, you will quickly spend a fortune and either want to buy slots on BigQuery or move to a data warehouse on Amazon.
- Every account is throttled so that it can’t starve other accounts from shared resources. Go read through their limits online, carefully, and not just their docs which do explain several limits. Athena doesn’t allow more than 20 concurrent users, for example. You can look through the data warehouse comparison guide as well. Also beware they are limited in how large the tables can be. One company couldn’t get beyond 5 billion rows with Athena.
In these cases, you may be looking at Redshift Spectrum as an existing Redshift user, or Presto. We’ll cover the differences later.
Athena vs. Redshift Spectrum vs. Presto
Once you realize you need a federated query engine, either in addition to or separate from a data warehouse, when should you use Athena vs. Redshift Spectrum vs. Presto?
I think there are a few simple rules.
When to use Athena
If you’re starting from scratch, you should consider Athena. It’s basically serverless Presto as a service, without the headache of having to set a lot up. Just point it at data and get started for $5 per terabyte scanned.
There are several limitations though. It is a shared, multi-tenant service, which means every user has to be throttled to protect every other user. With Athena, you’re allowed at most 20 concurrent users. The other limits aren’t always clear because it depends on whether there are available resources. But it means you cannot guarantee good performance. You certainly will not get second-or-less performance at any reasonable scale.
Scalability is also an issue. As I mentioned before, one Athena customer could not get past tables or join results greater than 5 billion rows. That may seem very large, but it’s not. You can have 1TB data sets with 5B rows.
When to use Redshift Spectrum
If you are a Redshift user, I think the decision is easy. Try Redshift Spectrum. It runs in the same VPC as your Redshift cluster, and Amazon will use up to 10x the number of Redshift nodes to support it. You may even get more scale than what you could on Athena. The price is the same, $5 per TB scanned.
If you don’t have Redshift, you can’t use it. Adding Redshift just for this use case does not make much sense either.
When to use Presto
If you need scale beyond the first two options, better tuning control and hardware control, or need your own cluster separate from Redshift and anything else, Presto is your answer. It will require other infrastructure, like Hive for metadata. You will need also need to build up your own experts who know how to optimize it. But if you need federated queries at scale and are willing to invest to get the best performance, scalability and price-performance at scale, Presto is a great option.
The future of federated query engines
Some data warehouse vendors may tell you that federated query engines are not the best option for analytics, or that they will disappear, for a host of reasons. I do not think that’s the case, for the same reason that Excel never disappeared as the #1 BI tool. Even though an integrated, governed data pipeline used by everyone may be the ideal, there will always be federated use cases where people need to bring multiple data sources together and federation is the only way.
Ideally, you should enable both the data warehouse and a federated query engine approach to support more analytics. The whole reason federated query engines exist, just like Excel, is to enable more self-service and more analytics. Provide the access points for federated query engines, and provide alternatives to them as a data pipeline the moment it makes sense to change. Once people find analytics that should be done more regularly, supporting both will give them a path towards implementing these analytics in a pipeline and data warehouse. That has always been one of the goals behind self-service analytics.