Making sense of a data lake, delta lake, lakehouse, data warehouse and more
Sometimes when I look at all the definitions of data lakes that are pushed by so many different vendors, I think of Eminem’s The Real Slim Shady and wonder; will the real data lake please stand up?
Part of the problem is that a data lake is like a hairdo; it’s constantly changing. Every few years I’ve had to go back and revisit my own definition of a data lake as we’ve learned more about what a data lake should be, and I’ve had to be “re-corrected” a few times.
This last round I asked James Dixon, who first defined it while he was at Pentaho. For the record, James and I worked together back in the late 90s, when he helped create a really good ad hoc analytics tool called Wired for OLAP. James and I were teenagers at the time, of course.
James’ original definition was:
“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data."
I went back to James because I wanted to get his perspective on whether it was OK for a data lake to be coupled to a specific compute engine. For example, back when data lakes started to exist in some form, several companies implemented one using Hadoop. This meant the way you accessed the data in HDFS was typically Spark and/or Hive. I was OK accepting the reality of this coupling because that’s what customers did, and the customer is always right. James, which may not be too surprising, has remained true to his original definition, and for a few great reasons.
My latest conclusion? I was wrong. A data lake should just be about storage. I realized it when I saw coupled storage and compute crippling some data pipelines and analytics architectures.
It’s really important for most companies to understand what a data lake should be, because many different vendors are selling different versions of “lakes” based on their role along the data pipeline. For example, if you are doing ETL, you need something like a “delta lake” to hold the raw data and intermediate data you create during Spark-based ETL and other data processing. If you want to combine your data lake and data warehouse, you might call it a “lakehouse.”
All of these definitions of a data lake that go beyond just “raw storage” cause problems.
The purpose of a data lake
After reliving the history, and the lessons learned, I came back to the original definition and goals of a data lake. It should be a storage area for raw data that makes any data readily available to anyone to use when they need it.
The data in a data lake should be:
- Raw: preserve as much detail as possible
- Secure: adhere to internal and regulatory data and security requirements
- Readily usable: be data that is cleansed and consistent across data sources
- Easily accessible: support data engineers’ and analysts’ tools of choice
You don’t need to, and really you can’t implement all of this at once. But if your data lake does not satisfy all these requirements, you should ask yourself why first, and then decide when you do need to implement these parts.
- If you do not have all the raw data because you’re trying to lower costs, is it because your “data lake” is too expensive? Are you using a data warehouse as your lake?
- If it’s not secure, is this an issue with S3 security? Something else?
- If you haven’t implemented data quality, how does each group clean it up, and can it lead to inconsistent data across groups? What is the cost of raw dirty data? This is perhaps where I have the most debates, which could lead to a whole blog series. I usually point out two realities. If your data is dirty and people lose trust in the data, they won’t use it as much. And the best place to fix your data is as “upstream” as possible, ideally when it’s entered.
- Who can’t easily access the data, and why? For example, if the data lake is accessed via Spark, who builds the Spark jobs and how long does it take? What’s the backlog?
That may help explain what’s missing from your version of a data lake, and why it may need to change. The good news is that plenty of companies have gone through the same challenges.
The previous iterations of data lakes
Many companies experienced at least three phases of data lakes so far:
- Using a data warehouse as a data lake, including modern cloud data warehouses
- Trying Hadoop (this is declining in use)
- Implementing a cloud data lakehouse that combines the data lake and warehouse
- Creating a modern (cloud) data lake that is just storage
Each phase came with its challenges.
The enterprise data warehouse (EDW)
It started with the enterprise data warehouse (EDW), the goal of having one data warehouse with consistent data for everyone. This was never really possible. But the goal made us believe you should only have one data warehouse as the source of the truth, even though in reality we know from surveys that companies had up to 30 copies of any given data.
By 2010, data warehouses were starting to break down as the central place for all data:
- Data warehouses could not store or process the massive amounts of new “Big Data”. A larger data store and staging area was needed.
- BI users were (always) looking to get around the data warehouse because they needed to get at new data and build new reports faster.
- Non BI users, like data scientists and data engineers, were trying to store and manage data for other needs beyond traditional analytics, from monitoring to data science and machine/deep learning.
Starting about a decade ago, companies started to try using Hadoop as a staging area in front of the data warehouse, and to support ETL as well. Over time, several early Hadoop deployments evolved into data lakes. As Spark, Hive, and Presto matured, it became easier to access data in Hadoop deployments. As the excitement around Hadoop and Big Data continued, it was used for many different workloads, including analytics. But Hadoop was designed for batch, not for high performance analytics. Hadoop also remained too complex for most companies to easily manage on their own. For these and other reasons, while companies have continued to use Spark, broader Hadoop adoption has slowed down.
Cloud data warehouse
At the same time, there was a huge push to the cloud. Around 2012 you saw Redshift, BigQuery, and Snowflake emerge. Companies started to migrate traditional reporting and dashboard use cases to the cloud as part of their larger cloud initiatives.
Around 2016, Snowflake started to market the idea of using the data warehouse as a data lake. After all, storage and compute are “decoupled” and storage cost the same as S3, so why not?
The big downside was that the only way to consume the data was through the data warehouse query engine. And if you’ve ever paid a bill for a cloud data warehouse, you know that compute can be reassuringly expensive. Many companies “throttle” their usage of the data warehouse to help prevent paying “too much”, which ends up making data less accessible.
Federated query engine
2012 was also the year when Facebook created and deployed Presto, in part to address the limitations of Hive. Then Facebook open-sourced Presto in 2013. In 2014, Netflix mentioned that they were using Presto with 10 petabytes of data. Over time Presto grew into the most popular open source federated query engine. The concept of a lakehouse using Presto or other federated query engines also became more popular.
I personally like federated query engines for exploring raw data and self-service analytics and do think they have a place in a data pipeline and analytics architecture (see my blog on choosing federated query engines and data warehouses for more.) Presto is also a great engine for accessing a data lake.
But a lakehouse, a data warehouse, and a federated query engine all address different needs. They cannot truly replace each other. A federated query engine can never deliver high performance, interactive analytics at scale the way a data warehouse can because a federated query needs to fetch all the data over the network. That’s one of the reasons a data warehouse optimizes storage and compute together, to minimize network traffic. A data warehouse cannot replace a federated query engine either, unless it includes one.
Apache Spark, Databricks, and Delta Lake
Cloudera and Hortonworks, now merged as Cloudera, weren’t the only “Hadoop” vendors to target analytics and push terms like data lake or lakehouse. Databricks, who offer Spark as a service, also started to push the concept of a lakehouse with Delta Lake.
I would label Delta Lake as the most modern version of the Hadoop-based data lake.
Delta Lake was created to make sure you never lost data during ETL and other data processing even if Spark jobs failed. While Delta Lake turned into more than just a staging area, it’s not a true data lake. Its name says it all; it’s a “delta lake”. It’s still mostly used to guarantee that all the “deltas” from spark jobs are never lost. It helps guarantee that the end data loaded into a data warehouse is correct.
More importantly, the “lake” is still tightly coupled to Spark. You can import parquet files into Delta Lake, but they are converted to a form of versioned parquet (to guarantee transactional writes) that can only be accessed through Spark.
In other words, Delta Lake gives us one of the newer versions of Hadoop storage we were expecting might replace HDFS after Spark replaced MapReduce. But it’s still not quite as easily accessible as a file system.
Will the real data lake please stand up
What I’ve learned from all these iterations is that the best long-term architecture is to have a data lake that completely separates storage and compute. A data lake should allow access to any data via just about any engine. You can maintain compute engines with your data lake as well. Presto and Spark are great options. But they shouldn’t be the only options.
Firebolt is actually such an engine. The majority of Firebolt deployments are implemented with a data lake as the source. The most common type of data lake we see on AWS is built on S3 as parquet files, but JSON, Avro, ORC, even CSV files are also used.
Firebolt is like Presto in that it can directly access and query external files in data lakes as external tables using 100% SQL. Data engineers can quickly build and deploy ELT by writing SQL and running them on any Firebolt engines (compute clusters). You can also access and load JSON using lambda array functions within SQL, and store JSON natively as a nested array structure to improve performance.
Because of this native support for ELT, most proof of concepts (POCs) and initial implementations have taken a few weeks or less. The subsequent projects have only taken hours or days.
Next steps in your data lake journey
Once you recognize where you are with your data lake(s) and what you need, start to plan ahead for the next stage in your data lake journey, because it is a journey.
A data lake will change over time, especially as your architecture matures and you add governance. I also went to Alex Gorelik, another data lake expert, since we also worked together as teenagers. When I asked for a definition, he gave me his book on the Enterprise Big Data Lake, which details out what a fully governed data lake should be. Compare your version of a raw data lake to this future state. Yes, I have been reading it, and no, I do not get commissions.
Plan ahead. Don’t make your data lake like an impulse hair cut and get locked into a vendor’s version of a data lake. Please stand up the real data lake, the one that’s right for you.