December 23, 2020
December 23, 2020

The data hitchhiker’s guide to cloud analytics

Listen to this article

Listen to this article

A brief history in analytics. Don’t Panic; history repeats itself

The answer is the application.

In the beginning, there was a data mess. It was by design of course. As we moved from one application with its own data on the mainframe and added the AS/400, VAX/VMS, the PC and three tier, the Web, J2EE … you get the idea. Different groups built their own applications based on different generations of technologies to solve their own specific problems without figuring out how applications would fit together. Even application integration teams focused on just enough integration to get the next application deployed, not on how to keep data consistent.

No wait. We need a report.

Then one day, an executive asked for a report. That executive realized every manager was giving different answers because they all had their own versions of the same data. So he asked someone to clean it up. Of course no application team would ever let anyone put a load on their own infrastructure in any way. The applications were already overloaded. So the cleanup had to be done independently.

Wait. The answer is ETL, and data marts … no wait … data warehouses … and spreadsheets.

This led to the concept of a single place to hold a single version of the truth for any analysis. And that required ETL.

  • Extract data from a source as a batch dump in the middle of the night when no one will be impacted
  • Transform data from all the different sources and try to piece it all back together
  • Load this new single version of the truth into a database for analysis.

The result looked a little like Frankendata. It had been glued together and changed so much just to make it consistent that it no longer matched with the original sources. 

But it worked. Sort of. First came the creation of dimensions and facts, and data marts in the sixties and seventies, with Bill Inman being seen as the father of the data warehouse. But Inman insisted on storing data using entity-relationship (E/R) models. That didn’t always work. So it led to a battle of the data Gods, between Inman, who was pushing for E/R in the warehouse and dimensions in data marts, and Ralph Kimball, who advocated dimensions everywhere in the form of snowflake or star schemas. No, not that Snowflake.  

And yet, this wasn’t really the big battle. 1979 gave us a taste of what was to come. Not only was Teradata founded, which helped drive the idea of the big, expensive data warehouse. But the biggest workaround, the spreadsheet, was also created by Visicalc that same year. Microsoft released Excel in 1985. Well done Microsoft. You still have the #1 reporting tool on the planet, in part because we still haven’t gotten it right.

The data warehouse strikes back.

I’m allowed to mix stories.

The battle between data warehouses for reporting, and data marts for ad hoc continued while others just ignored them and used spreadsheets. IRI and Essbase were great options for ad hoc. I still remember the year of the million member dimension with Essbase. Data warehouses continued to grow and become a corporate standard as companies seeked to build one source to rule them all (yes, I’m still mixing stories.) 

Data marts remained a domain-specific niche that got swallowed by the big Red database vendor. IRI was acquired by Oracle in 1995, and Arbor Essbase became Hyperion Essbase in 1998, which became Oracle in 2007. Today, data marts sound like a thing of the past, even though as you will soon agree, we use them all the time.

The re-rise of BI and ETL

In the mean time, we saw a few generations of business intelligence (BI) tools. In the 90s, Cognos, Crystal reports, Business Objects and others continued to grow for reports, ad hoc, and everything in-between, and get acquired. Microstrategy grew and remained independent. In the 2000s we saw the rise of Tableau, Qlik and Spotfire for more interactive dashboards. In the 2010s came Looker, Mode, and a host of newer tools that are bringing dashboards and data science to the next level.

This move to more interactive, real-time analytics also brought change to ETL and data integration. Ascential, Informatica and Acta all drove data integration in the 1990s. Then came new entrants such as Talend. Oracle and others entered the market. 

Then came the cloud. Boomi entered, Informatica executed well, and so did Mulesoft. But the two really big stories were Big Data and the cloud.

Actually, the 1990s really did change everything. 

The Internet changed everything. It just took time.

Data exploded because of the Internet, and the new Internet companies that emerged, including Amazon in 1994. After a decade of evolution by these new players, they started to reinvent how applications and analytics should work. They started to teach the rest of the world how to use all this new data to deliver a much richer, more personalized, and more real-time experience to consumers. This was the next-generation movement towards real-time business.

I’m not saying it was all great.  Some weird things get created when you blindly reinvent the wheel. In general, when you get a new hammer, everything looks like a nail. The first hammer to become popular was Hadoop.

The answer is Hadoop. What’s the question?

In 1992, Walmart’s data warehouse hit 1 terabyte … by accident. Someone decided to use it as backup storage instead of tapes. Big data and data lakes were born. In the 1990s data started to explode. By 2010 the first data warehouses were reaching 1 petabyte and starting to fail to scale. 

The first big open source innovation that really was new for big data analytics, and not just the same old software turned into open source, was Hadoop.

Hadoop was designed to be used as a batch-based scale-out storage and compute engine for semi-structured data analytics.  The two big vendors that pushed it initially were Cloudera and HortonWorks. Relatively quickly, Cloudera and Informatica worked together to help solve the data warehouse scalability challenges by creating a staging area in front of the data warehouse, pulling the “T” in ELT out of the data warehouses, and doing more of the work in Hadoop.

Until Hadoop wasn’t just for batch processing. Suddenly people started to try and run analytics on Hadoop. They even tried ad hoc. We all somehow ignored that Hadoop was batch-centric. The answer to most problems became Big Data and Hadoop. Hadoop wasn’t just the wrong tool for ad hoc. It was also too complex for anyone but the best of the Internet companies. 

Cloudera had the answer all along. They were originally called Cloudera because they intended to run Hadoop in the cloud as a service. Unfortunately all the data was on premises, so Cloudera quickly shifted on premises to collocate Hadoop with the data.

Then suddenly, just as Big Data was caught in the trough of disillusionment it actually just disappeared from the hype cycle in 2015.

The answer is the cloud, the cloud data lake, cloud pipeline, and cloud data warehouse.

2012 was another magical year. Just as Hadoop was going after data warehouse offloading, analytics started to move to the cloud. BigQuery was released (OK, it was November 2011) Amazon released Redshift (OK! It was a preview in 2012. The full release was Feb 2013), and Snowflake was founded.

What happened to Hadoop? It has quietly moved back to where it belongs as a hammer; mostly.  Spark came along and fixed some issues. More people also run Spark and Hadoop in the cloud now. The cloud data lake was born, which is basically a place where you store raw data. Presto (the foundation of Amazon Athena and Redshift Spectrum) become really popular. 

By 2020, many companies moved reporting and dashboard workloads to cloud data warehouses. Looker and Tableau took off as two major choices for dashboards on top.

The answer on how to deliver real-time, personalized business

The big business question since the 1990s has really been “how do I implement a real-time, personalized business?” Fortunately, that question is just now starting to be answered.

It’s not existing ETL pulling data from existing applications. It’s not today’s cloud data warehouse. It’s not even today’s BI tools.

Don’t Panic.

The companies that started in the 90s have started to show the way. It involves a combination of a real-time data pipeline for big data that is on top of and integrated with a batch-based pipeline. This real-time pipeline has a host of processing engines along the way including different analytics engines. One of those is a high-performance cloud data warehouse that can deliver fast analytics at the speed of thought, so that employees and customers can analyze data in real-time. Firebolt is one example of this new type of data warehouse.

Welcome to the journey.


Read all the posts

Intrigued? Want to read some more?