Are you running your analytics workloads on PostgreSQL? Is it delivering the expected performance for your dashboards and customer facing analytics? A consistent trend with PostgreSQL implementations is the challenges customers face delivering fast, responsive analytics at scale. As one of the most widely used Open Source Database offerings, with broad community support and managed offerings from Cloud Providers it is easy to spin up a production-grade PostgreSQL aka Postgres database. But is it truly the right engine for analytics?
As a row-based relational database, Postgres is designed for transactional systems. The focus here is on a normalized data model geared towards single record queries and transactional consistency, combined with indexing strategies to address fast data retrieval. Data access and updates are fast and it is simply easy to adopt. This leads to additional use cases, rapid data growth and analytics for decision support. OLTP workloads align more with row-oriented databases as typical insert/updates are performed on individual records. In contrast, analytical workloads typically scan a subset of columns from a given table, hence lend themselves to columnar orientation, which can be further optimized by sorting, compressing and indexing columnar values. As a result, developers find themselves at the crossroads of two different workload profiles, trying to determine which route to take. There are different approaches that can be used, however, as with most technical challenges it depends on the specifics of each situation.
Mixing transactional and analytical workloads on Postgres
This approach can support the needs of the business, especially if the data volume and rate of change are low enough that the implications of mixing workloads does not matter. If there is a performance impact, Postgres read replicas can offload reporting away from the production application. Granted you are doubling up on the infrastructure with additional copy of data and compute, but this provides isolation between the two workloads. However, the read replica retains the schema from the production instance, limiting any potential gains from offloading. Also, the freshness of data on the replica can vary depending on the replication and change rate on production.
Materialized views are another option to reduce the pain of mixing workloads. Calculate aggregates during off-peak hours to reduce the production impact of running reports and analytics. Definitely helps meet reporting SLAs. However, you better get used to manual refresh of materialized views and rewriting queries to make use of them.
Offloading analytics workloads to a different Postgres instance
If you are looking at hundreds of GBs of data to report against with increasing user counts, another option would be to offload production data to a different Postgres database / instance that is optimized for reporting. With this approach, you are continuing to leverage your existing skill sets while optimizing your data model for analytics. Denormalizing the data model, reducing indexes, using parallel query, materialized views, and partitioning can drive performance improvements for the analytics workload. You may even be able to add third party Postgres add-ons to address analytics workloads.
The case for purpose-built analytics solutions
While the two options above provide relief there are other fundamental considerations. This may be a time to pause and consider your overall data strategy. Are you thinking of building a data warehouse and PostgreSQL happens to be the database you're most familiar with? Does this align with your business goals and, will this approach support the data volume as the business grows? Do you need to integrate other data sources to get a 360 view for your business ?
From a pure techstack standpoint, will you hit any limitations or operational challenges with above mentioned approaches ? For example with a materialized view, will the user know when to leverage a materialized view vs querying the original table ? What is the frequency of materialized view refresh ? How current is the data ? How does the query optimizer leverage this data? The same applies for parallel query tuning. Is the parallel query configured with the right number of workers ? Is it suitable for the specific query ? and more.
In typical database performance tuning fashion, you need to pay attention to disk IO. If you can avoid a disk IO, the benefits are tremendous. However, this means you need to be aware of every query that you send to the backend. Not really a great way to scale. Another common challenge is sharing compute between ingest and analytics, especially if it is large-scale ingest. The two workloads can conflict with each other and customers need to look for ways to reduce the impact.
Another consideration with hyperscalers like AWS, GCP or Azure is that the providers specify the performance profile of the PostgreSQL instance in terms of vCPU, RAM, network bandwidth, iops, throughput etc. There are maximum capacity limits enforced by the providers, depending on the choice of instance. These numbers vary and if you are collecting large volumes of data you are bound to hit capacity and performance limits for your PostgreSQL instance. Awareness of these limits is important based on your scaling needs and budget.
Expanding on the above theme, something most folks do not think of is the IO profile of the disk types offered by cloud providers. The IO profile of storage components, be it SSD or HDD, is typically fixed ( for example: 3 IOPS/GB with a maximum of 10,000 IOPS). In cases where the IO profile is customizable, you can end up paying a significantly high cost to make use of this capability. When you look at striking that perfect balance between price and performance, this will require a lot of babysitting.
Point is, while the PostgreSQL approaches above can work, there are infrastructure restrictions that impact the scalability of the solution. Instead of retrofitting your PostgreSQL instance for analytics, take a future-proof approach by running analytics workloads on purpose-built solutions like Firebolt.
What makes Firebolt a great analytics solution to replace PostgreSQL …
Firebolt was built from the ground up to deliver the fastest analytics experiences, over a cloud native architecture. As a SaaS offering with no servers to manage, bring your dataset, sql queries and you are off to the races.
Firebolt’s decoupled compute and storage makes it easy to scale compute independently. This allows isolation of workloads, spin-up/spin-down of environments, eliminates shard management and improves automation. With parallel, vectorized processing of data in petabyte scale object storage, Firebolt eliminates challenges associated with infrastructure. Additionally, you can isolate ingest and analytics workloads to different sets of compute or isolate conflicting workloads to their own engines while sharing a common storage platform. This means the common problem of ETL workload impacting your customer-facing dashboard goes away.
Firebolt leverages a columnar storage format designed for efficient data access. By reading compressed data, the amount of data scanned is reduced to a fraction of the table size requiring less hardware resources and improving performance. Do you want to drive this efficiency higher, use sparse indexes to further reduce the amount of data scanned and your response times improve further …
How about leveraging materialized views ? You can use aggregating indexes that are auto-synced at ingest time and provide sub-second responses for your reporting or dashboard needs. No more manual materialized view refreshes.
The beauty in all this is that you can do it all with SQL. Do you want to scale your data warehouse or scale down to zero to save costs ? You can do that through SQL. Do you have a requirement to integrate with your favorite visualization tool like Tableau or Looker ? Firebolt has a growing ecosystem of integrations that you can leverage.
Why did a PostgreSQL customer choose Firebolt for analytics ?
OmniPanel provides e-commerce insights for companies by integrating data from companies like Shopify, Returnly, and Sendgrid. Their software connects the dots between customers, orders, shipping, returns, marketing, support, and more to supercharge customer insights reporting cross-functional work. While OmniPanel was building out a technology stack, they began to identify that the data layer on PostgreSQL would not scale to meet the needs of their customers.
After evaluating multiple analytics solutions, Omnipanel chose Firebolt. Firebolt’s performance was 15x faster than the current Postgres database, with an average query time of less than 500 milliseconds. They were able to increase the performance even further with a new data model and using Firebolt’s aggregating indexes. Final improvement numbers are shown below for key Omnipanel queries.
Need to dive deeper ? Firebolt solutions architects are here to help.