What is a data warehouse and cloud data warehouse?
First, let’s get the easy part out of the way, the definition.
A data warehouse (DW), which is sometimes positioned as an enterprise data warehouse (EDW), is a relational database that is optimized for online analytical processing (OLAP) analytics and business intelligence (BI), not online transactional processing (OLTP). The data warehouse has been the most common database for analytics. Now the most common data warehouse deployed for any new analytics is the cloud data warehouse. A cloud data warehouse (CDW) is any data warehouse-as-a-service.
Data warehouses were created in part to offload analytics from traditional relational databases because the various applications that ran on top of these databses were already overloaded. But the other, more technical reason is that OLTP databases - such as DB/2, MySQL, Oracle, Postgres, and SQL Server - were not as good for OLAP because they were row-based. Columnar storage is much better for general purpose analytics.
Why? Most analytics workloads need the ability to pull a large set of rows from a much larger table, or tables, using predicates (conditions like equal) based on a few columns. Row-based databases Are great for updating and deleting data fast, supporting transactions, and doing specific “one row” lookups. But they store a table as one row after the other sequentially. In order to execute the conditions you first have to read each full row just to get at each column value.
Columnar databases make analytics faster by storing each column sequentially. It means you only need to read the column, not the entire row. You can also compress each column if they have repetitive values, which makes reads even faster. There are other more advanced tricks like query vectorization. But in short, columnar is much faster and more efficient for analytics. Columnar storage with advanced indexing is the fastest.
In other words, row-based databases are great for transactions, and column-based databases are great for answering questions. This is an important point. A data warehouse and columnar storage is not just for traditional analytics or OLAP. it’s good for asking many other types of complicated questions to support operations or deliver new services, such as:
- Sophisticated “lookups” for support or operations, such as looking up all the purchases or actions of a customer in a time range, or all products or shipments traveling a particular route.
- Operational metrics for monitoring, such is tracking network performance (telemetry), application monitoring, customer service, supply chain, or machines and the Internet of Things (IoT)
- Customer-facing applications where customers need their own data. At some point, when the overall data set becomes too big or the queries become complex enough, row-based databases will become too slow.
Data warehouses vs. data lakes vs. federated query engines vs. analytics databases
A decade ago, data warehouses had a lot of challenges with speed, scale, and cost. In response, the analytics database market became very fragmented as different technologies emerged to solve slightly different problems.
- Data lakes: Hadoop was first. It provided scalability, lower costs, and support for other more batch-based compute workloads, including ETL, which offloaded some compute (ELT) and storage off of data warehouses. Hadoop was both compute and storage.
- Data lakes: Eventually people started to separate compute and storage. Data lakes become storage only designed to let different compute engines access it.
- Federated query engines: The most popular way to access data lakes with with federated query engines. Hive, Presto, and Spark, which were initially added to Hadoop, evolved into query and compute engines on top of data lakes. Since then we’ve seen Amazon Athena and Redshift Spectrum, Dremio, and Databricks Delta Lake (which includes Photon) emerge. That’s a lot of different data lake options, but that’s the point; that the data should be accessible by anyone using any tool to get the most out of the data.
- Cloud data warehouses: around 2012, BigQuery, Redshift, and Snowflake were all released or announced as one of the earliest cloud compute workloads. Together they simplified deployment and management, and delivered elastic scalability with decoupled storage-compute architectures. Since then others, including Firebolt, have emerged. Unlike data lakes, while they are also built as decoupled storage and compute, ingestion and storage is closely coordinated with the compute to improve query performance.
- Specialized analytics databases: Apache Druid was initially built by Metamarkets starting in 2011, and then moved to an Apache license in 2015. Initially built to power ad tech analytics, it evolved into a specialized analytics database for fast query performance. Then came Pinot and ClickHouse. Time series databases like InfluxDB and TimescaleDB, and a host of other specialized databases emerged as well. You can think of them as purpose-built engines to support specific types of analytics and data applications. They are not as flexible for general-purpose analytics as cloud data warehouses. But they are generally faster. This has started to change as cloud data warehouses like Firebolt start to add similar techniques.
So what are the major differences between data warehouses, data lakes, federated query engines, and the various types of specialized analytics databases, and how should you choose between them for each type of project? After all, functionally a data warehouse, data lake and federated query engine, and the various specialized analytics databases can all run queries.
In short, you should answer three questions:
- What are your performance requirements? Do you need a 1 second SLA or not?
- Do you know what data you need in advance?
- Do you know what queries you need to run in advance?
Based on your answers, here are your choices:
If you want to learn more about choosing between Druid, ClickHouse, and Pinot vs data lakes and data warehouses, you can read this blog.But please note: you will most likely adopt all of these technologies over time as part of your end-to-end modern data stack because each serves a different purpose.
You will likely have a decentralized data pipeline that feeds into a data lake, and then either from the data or directly into data warehouses and databases depending on the types of data. Your data pipeline is effectively your data. When implemented properly, it gives you the flexibility to add or change databases and data quickly.
Data warehouses vs. cloud data warehouses
There are two really big differences between on premises and cloud data warehouses. The first is of course that cloud data warehouses are only in the cloud. At this point most on premises data warehouses can also be deployed in the cloud. The second is that most cloud data warehouses support a decoupled storage-compute architecture.
With a traditional coupled storage-compute architecture your compute cluster has to hold the entire data set and run all the queries. This is also the case with all specialized analytics databases.
Over time several relational databases have added replication to scale out query loads across identical clusters, and separate reads and writes. But these architectures are limited in scalability because each cluster must still hold all the data and you cannot completely separate compute workloads.
Amazon Redshift is unique as the only cloud data warehouse which was originally a coupled storage-compute architecture built using ParAccel technology. Redshift now has a hybrid architecture where you can execute federated queries via Amazon Redshift Spectrum (which is similar to Amazon Athena) and also deploy newer RA3 nodes that only cache the data they need, not all the data. But you still cannot completely separate queries across different clusters.
In a decoupled storage and compute architecture used by Snowflake, BigQuery, and Firebolt, you can bring up any cluster with a subset of the data and distinctly different queries and users. You can also add or remove nodes for each cluster with a few clicks. This not only makes scaling better, easier, more cost-efficient. It also allow you to isolate workloads from each other, and support different service level agreements (SLA) for each workload.
One last, and very important point. Do not assume modern cloud data warehouses are faster. On premises data warehouses have been hardened for decades and can be very fast. Even Redshift can be faster than Snowflake or BigQuery. Make sure you run your own benchmarks on performance and price-performance to understand the differences.
Comparing cloud data warehouse vendors
Comparing cloud data warehouses can be confusing. Over the last decade, the number of options have exploded.
- Hybrid cloud data warehouses which evolved from their on premises roots such as Oracle, Teradata, Vertica, or Greenplum, built for BI teams for reporting and dashboards
- 1st generation cloud data warehouses, such as RedShift, which was built on ParAccel, that helped “Shift Red” (Oracle) and other data warehouse deployments by porting data warehouse technology to the cloud and simplifying deployment
- 2nd generation data warehouses, such as Snowflake and BigQuery, which improved scalability by separating storage and compute, and simplified administration
- 3rd generation data warehouses, such as Firebolt, which improved performance and lowered costs by adding the latest innovations in data warehousing, more computing choices and control, and different pricing models
That’s not even including the federated query engines or specialized analytics databases.
Making the right choice partly requires a deeper understanding into how these data warehouses and query engines compare both based on features including:
- Architecture: elasticity, technology support, isolation, and security
- Scalability: elastic scale, scale efficiency, query and user concurrency
- Performance: query, indexing, data type and storage optimization
- Use cases: reporting, dashboards, ad hoc, operations and customer-facing analytics
- Cost: administration, infrastructure resources, vendor pricing
You also need to evaluate each cloud data warehouse based on the use cases you need to support, which include:
- Reporting by analysts against historical data
- Dashboards created by analysts against historical or live data
- Ad hoc analytics within dashboards or other tools for on-the-fly interactive analysis
- High performance analytics of very large or complex queries with massive data sets
- Big data analytics using semi-structured or unstructured data
- Data processing used as part of a data pipeline to deliver data downstream
- Machine or deep learning to train models against data in data lakes or warehouses
- Operational analytics needed by much larger groups of employees to help them make better, faster decisions on their own
- Customer-facing analytics delivered as (paid) service-service analytics to customers
For a detailed comparison that uses all these categories of features and use cases, you can read the cloud data warehouse comparison guide. You can also compare a few of the data warehouses side-by-side using a slightly more interactive cloud data warehouse comparison tool.
In short, here is what you will find:
- Redshift, while it is arguably the most mature and feature-rich, is also the most like a traditional data warehouse in its limitations. This makes it the hardest to manage, and not as well suited for the newer use cases or for separating workloads.
- Athena is arguably the easiest, least expensive and best suited for “one-off analytics”. But it is also the most limited, and requires you to manage your own (external) storage and ingestion very well, which is especially hard for continuous ingestion.
- BigQuery combines the best of Athena as a serverless data warehouse with the option to purchase reserved compute like Snowflake. While on-demand pricing is often too expensive for regular workloads, reserved and flex slots deliver pricing that is competitive with Redshift and Snowflake. But BigQuery comes with the limitations of a shared service as well, including extensive throttling and other limits designed to protect BigQuery customers from rogue queries. If you buy reserved or flex, be sure to remember you can remove these limits. Its query performance is also similar to Redshift and Snowflake, and is too slow for interactive analytics at scale.
- Snowflake as a more modern cloud data warehouse with decoupled storage and compute is easier to manage for reporting and dashboards, and delivers strong user scalability. It also runs on multiple clouds. But like the others, Snowflake does not deliver sub-second performance for ad hoc, interactive analytics at any reasonable scale, or support continuous ingestion well. It also can be expensive to scale for large data sets with complex queries or semi-structured data.
- Firebolt is the only data warehouse with decoupled storage and compute that supports ad hoc and semi-structured data analytics with sub-seconds performance at scale. It also combines simplified administration with choice and control over node types and greater price-performance that can deliver the lowest overall TCO. This makes it the best choice for ad hoc, high performance, operational and customer-facing analytics with a 1 second SLA.
This guide is meant to help you with that journey. It is broken down into three sections, which you can read through in any order:
- Introductions and definitions of the different components of existing and modern data stacks
- How existing and modern data stacks work together
- You will also find a roadmap that architects and data engineers have used to layer modern data stacks on top of existing systems and pipelines.
Remember the words on the Hitchhiker’s Guide to the Galaxy: Don’t Panic. Implementing your modern data stack, and choosing the right data warehouse(s), data lakes, or federated query engines does not need to be a hard journey.