There are many versions of the tale of Medusa. In one, Athena turns the beautiful Medusa into the many-headed Gorgon after she has an affair with Poseiden.
There are also many stories of companies creating a many-headed, slow, federated Gorgon out of their data infrastructure. In modern-speak, that means they tried to federate everything.
Federated query engines are great for the infrequent “one-off”, drill-down, or self-service analytics use cases where you can’t have the data in a single place and second-level performance isn’t important. As federated query engines, Athena is a great choice for getting started with analytics, Redshift Spectrum is great for Redshift customers, and Presto is for everything else, including large data sets and higher user concurrency.
But choosing between the best federated query engine and a data warehouse - learn why an optimized data warehouse like Firebolt is always the better technology.Data warehouses are also adding more external data support, so if you only need some federated queries, you may be able to get away with just using a data warehouse.
I decided to break this blog into two parts because there are two decisions you need to make:
- First, when do you choose a data warehouse vs a federated query engine? That’s this blog.
- Second, once you do decide you need a federated query engine, which one is right for you? That’s the next blog: Athena vs. Redshift Spectrum vs. Presto - How to choose the best federated query engine for each use case
Let’s first dive deeper into data warehouses vs federated query engines.
Optimized data warehouses like Firebolt always outperform federated query engines
Federated query engines like Athena and Presto can never be as fast or as scalable as a well-tuned data warehouse with dedicated resources like Firebolt.
Notice I said well-tuned data warehouses. If you have looked at benchmarks closely, such as the FiveTran benchmark, you may have noticed that Presto seems to perform nearly as well as Snowflake, Redshift and BigQuery. The developers around Presto have done a great job. It is one of the many reasons Presto is so popular. But it also points to the fact that Snowflake, Redshift and BigQuery are not actually optimized for performance.
Firebolt has optimized for performance. It has shown 10x or faster performance in most benchmarks compared to other data warehouses and federated query engines. Firebolt has been 10x-182x faster than Presto and Athena in various benchmarks. One company found Firebolt was 48x-6000x faster than Snowflake across its queries … at 1/10th of the cost.
You may wonder why Snowflake, Redshift and others have not optimized their performance. I personally think there are two reasons. First, they have been focused on moving reporting and dashboard workloads to the cloud. Caching helps with both these workloads. Redshift, Snowflake and Presto all cache. Second, I don’t think the competition pushed them into it. Now that there is competition, as you will see below, it would take a rewrite.
Why optimized data warehouses are faster
Why does an optimized data warehouse deliver faster performance than a federated query engine? The main answer is that federated query engines don’t manage their own data. If you want to be fast, especially for any ad hoc or interactive analytics, you have to control data ingestion, storage, and access.
Think of a federated query engine as the compute in a completely decoupled architecture with the compute on one side, the storage on the other, and the network in the middle. There are huge benefits you get from a decoupled architecture. But it makes the network the biggest bottleneck. Many corporate networks are 10 gigabits/sec (10Gbps) which means you can move roughly 1 Gigabyte (GB) per second if you’re lucky. AWS supports 10, 25, and 100Gbps network speeds. But even 10GB per second means large queries can take seconds or more just to fetch data over the network.
In order to minimize the cost of data access during the query, you have to minimize the data you move. There are only a few ways to do this:
- Storage: sort, compress, and index data so that you can move only the data needed, and move it in as few “batches” as possible from S3.
- Caching: Cache data locally from previous retrieval.
- Pre-fetching: retrieve data before the query as much as possible.
- Query optimization: implement optimization that does as much filtering as possible to minimize data access, in addition to optimizing the query performance overall.
All of this requires controlling the data from ingestion to query. Here’s how it works with Firebolt.
Any node in a Firebolt engine (cluster) can ingest and write to a table. For each table when you create it you add a line that declares its sparse index. It’s a composite primary key consisting of any columns that would be used in queries.
Data is written as a new segment, whether it’s 1 or 1 million rows, and is non-blocking. The sparse index is updated to point to the new data, and other indexes are updated, with each write. That includes indexes for tables, joins, materialized views, and aggregations. Over time, segments are combined to optimize performance, but it is done in the background.
Firebolt has another trick. Any engine can be used for any combination of ingestion and queries. You can query external data, process, and ingest using 100% SQL. During ingestion, all data can be cached and used in queries.
Other data warehouses rewrite entire columnar partitions during each incremental ingestion, in part because they do not support indexing. For example, in Snowflake, a pruning index only keeps track of the min and max range of data in each micro-partition. As data is ingested, it’s written into existing micro-partitions, which are really columns compressed and stored in files. Since you cannot just write a single row across separate compressed files (one for each column), you need to rewrite all the files together. Because each micro-partition is 50MB-150MB, that takes time. It’s one reason you cannot ingest into Snowflake faster than 1 minute (micro-batch) intervals.
Federated query engines don’t manage ingestion at all, which means they cannot cache during ingestion by default. They also cannot easily support indexing. Keeping data consistent and indexes consistent across multiple data sources is nearly impossible. That’s a big problem. There are some cases where eventual consistency is OK. Analytics is not one of them. If you think the data might be inconsistent, you won’t trust your results.
But companies have managed to reproduce the level of ingestion and compression done by non-optimized data warehouses outside of Presto, with some work. It is also possible to prune segments if done right.
For each query, the Firebolt query optimizer uses the various indexes, including the sparse, join, and aggregating indexes, and query plan re-ordering to minimize query times. This includes filtering down data sets not in the cache before accessing them to minimize data movement.
Somewhat surprisingly, most other cloud data warehouses and query engines do not do extensive query optimization. Presto at least supports some pushdown queries to source databases. But they are not able to do as much optimization to avoid large joins. The net result is that you see roughly the same (order-of-magnitude) performance across most cloud data warehouses and Presto.
Firebolt is similar to Snowflake, BigQuery and RA3 nodes with Redshift in caching data following first-time access. But Firebolt has a major advantage over other data warehouses and federated query engines in that it can cache data during ingestion. This can dramatically reduce the amount of data needed from remote storage.
Data warehouses and federated query engines do extensive caching. Some even do result-set caching to store query results, which works really well for repetitive query workloads. But these caches are not consistent with the data. That’s a big problem.
If you want to understand why, look at BI caches. Some BI tools have done federated queries for a long time. They extract data from the source, and then cache and optimize the data for the specific tool. Rather than end up with stale data, these BI tools try to rely on heuristics to update “deltas”. But without guaranteed consistency, they often have to invalidate their cache and revert back to a query against the original remote data.
This is exactly the same problem with sidecar caches like Redis. They do not capture every write as it happens compared to a read/write-through cache like Apache Geode, Gigaspaces, Hazelcast or Apache Ignite. A sidecar cache, which is managed by its application, gets a copy later from the application, when data is accessed. So the application never knows if the data is the latest copy.
The more you invest in advanced heuristics, in getting alerts as data changes, the more you are basically building a data pipeline the hard way. It will always be easier to just build a flexible data pipeline and let a really good data warehouse do the ingestion and optimization together.
Sparse indexing allows Firebolt to directly access data within segments without having to pull the entire segment over the network. This can reduce the amount of data moved 10x or more even after all the other query optimization and caching.
Here is the main point why truly optimized data warehouses win. Other data warehouses would need to rewrite ingestion, query optimization, and data access, as well as add indexing. Federated query engines cannot do any of these things because they cannot change ingestion, by design.
As federated query engines, Athena is a great choice for getting started with analytics, Redshift Spectrum is great for Redshift customers, and Presto is for everything else, including large data sets and higher user concurrency.
Data warehouses and/or federated query engines
If you take away just a few facts about choosing a data warehouse and/or a federated query engine, remember these.
- You can have the need for both, but in general, one does not replace the other well.
- Federated query engines are great for the infrequent “one-off”, drill-down, or self-service analytics use cases. Consider Athena for simplicity and getting started, Redshift Spectrum as a Redshift customer, and Presto for big data scale and higher concurrency.
- An optimized data warehouse is always the better technology for ad hoc, interactive, high-performance, big data, or any analytics with a large number of users.
- Data warehouses are adding more external data support, so if you need a data warehouse with “a side of federated queries” a data warehouse may be enough.
For more on federated query engines, read part 2: Athena vs. Redshift Spectrum vs. Presto - how to choose the best federated query engine for each use case.
Redshift, BigQuery, Firebolt and Snowflake all have external table support that will continue to expand. The reason is the market will push them towards self service analytics. The best approach for data management is a data pipeline with analytics engines that can support federated queries and move them back into the pipeline quickly to improve performance. This was the promise of self-service analytics that started a decade ago, and it is possible.
In the meantime, if you need a federated query engine, first lay out your criteria where you would use a data warehouse vs federated query engine. Ideally you also layer your architecture so that you can easily move queries between federated engines and data warehouses the way Informatica did.
Consider the following when you compare and contrast for each use case:
- Data ingestion: Data warehouses have a huge advantage over federated query engines because they are in the path of the write. Determine how fast you can ingest new data into a data warehouse or data lake. With Firebolt, for example, data engineers use SQL to write and deploy new ELT in hours, and then can spin up new clusters that work just on that data completely isolated from other production engines. If you can deliver on a fast enough SLA, you may not need federation.
- Data sizes: Federated query engines are limited with large data sets, especially Athena, and can fail with joining two large fact tables. Understand your data first to make sure federation will even work. Presto with pushdown will be the best answer for such federated use cases.
- Data access: The biggest bottleneck with Big Data is the network. Federated query engines often have to pull all the data over the network before performing operations like joins. In the big data world, that is bad. Pruning indexes make a difference in limiting the number of partitions fetched over the network. But you’re still pulling partitions, which is inefficient. Measure the delay from data access and see if it matters. And see if you can do any pushdown or other techniques to reduce network utilization.
- Optimization: whether you use materialized views or indexes for aggregations, joins or other pre-calculations to accelerate performance, they all depend on the data being up to date. Understand how your federated query engine performs the optimizations and compare with aggregation, materialized views and indexing in other products. If it’s just caching, that won’t help for ad hoc and first-time queries. You may need to benchmark to understand the difference, but the testing will be more than worth it.
Athena, Redshift Spectrum, and Presto have a clear place in analytics. For more, read Athena vs. Redshift Spectrum vs. Presto - how to choose the best federated query engine for each use case.
But if you let them get used without any control against a data lake, you can easily end up having to chop the federated compute off the storage and starting over. So make sure you prevent your Medusa.