When to use Postgres, MySQL, in-memory databases, HTAP, or data warehouses
I’ve had many conversations over the last 5 years because existing databases, in-memory technologies, or cloud data warehouses could not support what some call the 1 second SLA (Service Level Agreement).
Some people talk about hot (1 second SLA), warm (10s of seconds), and cold (10s of seconds to minutes) analytics. Others talk about traditional BI, internal or operational analytics, and client-facing analytics, which is basically SaaS analytics for your customers. Some operational and most client-facing analytics have an informal 1 second SLA. Why? Most customers do not like to wait. As consumers, many of us start to abandon a Web site when response times are more than 2 seconds.
The first big elephant in the room during my conversations, the one slowing down analytics, is usually a transactional database. There is no one vendor that stands out as being worse than others. There are many versions of the Postgres “elephant”, even some analytics-optimized versions like Greenplum, Netezza, ParAccel, and yes, Redshift. There have also been a few variants of MySQL, not to mention in-memory caches and other cloud data warehouses.
But almost all of my conversations seem to fit into 3 stages of performance challenges as companies have grown:
- One database for transactions and analytics
- A separate database for analytics
- A separate data warehouse for all analytics
- The 1 second SLA
I could say just skip some stages, but that’s not how applications evolve. There are valid reasons for adopting and then changing the technologies at each stage of growth. I do hope understanding these stages in advance will help you. Perhaps if you recognize the stages, and the limitations of Postgres, MySQL, and even many data warehouses in supporting the 1 second SLA, perhaps you can skip a few stages and save yourself some pain. As a famous old mentor once said “forewarned, forearmed Grasshopper.”
1. One database for transactions and analytics
Many internal apps or SaaS offerings I’ve seen start on a single database that supports everything, from transactions to analytics. Postgres and MySQL are both widely used, and why not. They are great options for supporting any kind of transactions or other operations where you need to guarantee you never lose data. They are both proven, with broad ecosystems. They’re also free. When you’re a startup with your first customers, the data is often small. Running your first analytics on the same infrastructure - especially when you’re just getting started, want to keep your architecture simple, and are cash-strapped - makes good sense.
But at some point, usually by the time data has reached a few hundred gigabytes or hundred million rows in my conversations, analytics performance becomes an issue. Row-based databases are great for finding and updating single rows. They are not as good for analytics that involve extracting large data sets, or aggregations involving group by and filter operations against large, wide fact tables. Scanning each wide row sequentially can be slow.
There are some tricks you can use when your data is still small, in the tens of gigabytes and tens of millions of rows.
- Keep your tables narrow: Don’t have many columns in the tables you’re using for analytics if your database is row-based. The fewer columns you have, the faster row-based scanning will be.
- Indexing: Try different types of indexes for lookups. They won’t always help in relational databases since most are oriented around returning a small number of records. But when you are doing lookups or other operations that benefit from indexing, you should consider it.
- Materialized views: Eventually you may want to create copies of your data sorted and organized differently for your analytics queries, and materialized views are one of the most common techniques. You may also pre-aggregate, for example, or denormalize.
- Pre-compute: you can pre-compute aggregates and some of the other most frequently used results. Filling a dashboard with pre-computed metrics and a simple select statement is much faster than nested queries with GROUP BY and other operations.
- Rewrite queries: data models and queries matter. A combination of a different data model and rewritten query can dramatically improve performance. So make sure to invest in your SQL skills and in optimizing queries, or that your database will optimize for you. Perhaps the most important tricks are to learn all the alternatives to joins. But there are many, many others. For example, in Postgres it’s sometimes recommended that you don’t use common table expressions (CTE, or WITH queries) for analytics. Postgres version 12 did add optimization that improves performance for CTEs.
1.5. Is HTAP an option?
I have had several people ask about HTAP, or hybrid transactional/analytical processing, and whether it makes sense. This happened a lot more to me when I was working on it.
It has become popular over the past years as an option. An HTAP database, or in-memory computing (IMC) technology like an in-memory data grid (IMDG), is designed to support both transactions and analytics.
Should you consider it? If you need to perform analytics as part of a transaction, and if having a single database for all your workloads ends up being simpler and lower cost for your project, then it can be a great solution. But if the analytics can be pre-computed, performed “before” the transaction, if you can get a fast enough response from your analytics engine, and if you can separate transactions and analytics, then don’t use HTAP.
I say this with the utmost respect for HTAP. I do believe it is something you should evaluate, and there are some great learnings from HTAP that are slowly working their way into other technologies. But as Rumpelstiltskin will tell you, “there’s always a price.” There are three reasons why HTAP can be more expensive.
The first reason is the cost of an in-memory architecture. In my experience, when in-memory row-based storage is used for analytics, it can deliver the same speed as disk-based columnar storage. But it is a lot more expensive, and also usually requires adding indexes to be as fast as columnar storage. Columnar storage with indexing, or even better, in-memory columnar data with vectorized processing outperforms anything else. It also tends to be less expensive than an in-memory-based HTAP architecture when it is just used for analytics.
The second reason is that it is more efficient to segment workloads, especially transactions and analytics. For example, Oracle allows you to have side-by-side row-based and columnar versions of the same table. It’s pretty smart for a mixed workload. The challenge is that the data models and queries that are best for your transactions, and best for your analytics are usually very different. You might start to create a second set of tables for your analytics instead of side-by-side columnar tables. But eventually you will find that scaling a single cluster to support the worst-case peaks of both workloads becomes more expensive. At some point it becomes more cost-effective to separate the various transactional and analytical workloads into smaller clusters that you can manage and scale independently.
2. A separate database for analytics
When all these tricks are no longer enough to keep your transactions and analytics on the same database, and HTAP is not a good fit, the only logical next step is to deploy your transactional and analytics workloads onto separate databases.
Postgres, MySQL, and other RDBMSs support replication, which helps scale for analytics. Sometimes limiting the writes to the write master, and the reads to the replicas can work. But by this point you are probably also changing your data model and your queries to be optimized for analytical queries. The two sets of schemas can end up being very different.
Offloading analytics is not new. It’s been the de facto approach for decades in part because transactional systems were often overloaded even before analytics were added, and the best data models, queries, and infrastructure for transactions and analytics were always different.
So the best and oldest trick is to offload your analytics. Databases are often designed to support read-only replication. You can also build your own, which is what most people do for analytics. The two biggest options are batch extracts, or replication using CDC (Change Data Capture). CDC-based replication built on the write-ahead logs might put 1-5% additional load on a database, not more, and provides you with up-to-date data. I’m excluding trigger-based extraction for a reason. It’s much more costly than CDC or batch. But you do need to make sure you are capturing all changes, including deletes, which you might lose with incremental batch extracts if you’re not careful.
Once you decide to offload the data, you will either need to add your own Extract, Transform, and Load (ETL) or ELT processes to get your data into the best target format, or start to use an ETL/ELT vendor. Several support CDC-based replication. But the more important decision you will need to make before you choose your vendors is how to set up your modern data pipeline. That’s for another blog.
3. A separate data warehouse for analytics
My next set of conversations start when companies have 1 billion or more rows, or 1 Terabyte (TB) or more data in a table. Looking back at these conversations, they can be divided into two camps.
In the first camp, analytics were offloaded to a separate database (e.g. Postgres, MySQL, or even a custom cache), and they were running into more performance issues. In this case, they were usually evaluating different data warehouse or analytics database options. The analytics ranged from traditional BI, to internal monitoring or SaaS analytics services for their customers.
The second camp had already deployed a data warehouse and had also run into performance issues. Some had even added a database or in-memory cache to store precomputed result sets for analytics. The main reason they were struggling is that they had a 1 second SLA, usually for internal operational monitoring of some sort, or because they had client-facing analytics, and as we know, customers do not like to wait.
4. The 1 second SLA
I do have one recommendation I hope you remember from this blog. If you have a 1 second SLA at Terabyte scale where people do not want to wait for queries, and you haven’t yet gone through using a row-based data database or traditional data warehouse for analytics, skip those earlier steps. Choose a high-performance data warehouse or specialized columnar database that was designed to handle 1 second SLA workloads at scale.
Please note that most cloud data warehouses do not support a 1 second SLA at Terabyte scale. Look at the Fivetran benchmark, which found 8-11 second average times using the TPC-DS benchmark with 1TB of data for a single user. The benchmarks that really matter are your own. Make sure you benchmark your actual production data and queries at peak concurrency, and recreate what you think will be your future production needs, not just your current ones.
If you have a 1 second SLA, there are some options for accelerating results. You can aggregate or pre-compute the results. But if you need to deliver a 1 second SLA at Terabyte scale, you need to combine columnar storage with indexing, query optimization and advanced execution techniques like vectorized processing.
Your main options today for supporting a 1 second SLA at this scale are specialized analytics databases like ClickHouse, Druid, and Pinot, or a cloud data warehouse like Firebolt. I do believe there will be more options in the future.
So how do you choose among these high-performance options for the various types of analytics? It’s an easier decision than you might think. You can read more about how to decide here. If you want to understand the differences between these options in more detail, we’ll cover that really soon.