Rob's high performance data warehousing rule #2
September 19, 2023

Rob's high performance data warehousing rule #2

There's no point in measuring anything, if the data team can't measure itself.

We've all heard the statistics, data warehouse initiatives have a HUGE failure rate. They constantly miss functional expectations, budget expectations, timeline expectations or worse, all of the above.

A big part of why this happens is data teams aren't measuring themselves so they can't communicate what's going on and adjust strategically.

One of the best tools for resolving this is simple logging. For every activity that occurs in the data warehouse, we need a log. When did the ingestion/summarization start? When did it end? What was the inserted and updated row count? Did the process fail integrity checks? If we log these metrics for every process occurring, even if summarized by hour, we have the ability to actually manage both the warehouse and the team.

This will fulfill the requirements Malcom Baldrige challenged us to meet:

On time: Did the activity occur within SLA? For hour X, how long did it take to finalize ingestion?

Backlog: Things simetimes go sideways, so a process stalls. When this happens, how many batches stack up at any given time that need to be processed? If at noon today, I'm three hours worth of batches behind, we need to know now AND maintain a record of that.

Volume: For a given ingestion, how many rows were affected? We can't manage physical capacity unless we know volumetrics.

Rate: This is a bit of a derivative of volume, but it's volume over time. We may see situations where rate increases for one period of time, and decreases for another. It gives us visibility into things like resource contention.

Quality: How often are processes failing. For a given time period, we need to know the ration of failures/successes.

Once we have these key metrics on every process by subject matter by hour/day/month etc. we can do wonderful things like compare today's performance against a standard deviation over the past three months. This is where high performance happens. Alerting on any anomaly in these metrics that breaks a half standard deviation from the norm will drive the data warehouse team to improve. Constantly.

The often more important thing though, is these metrics become the source of truth for how the warehouse team is doing. When management wants to know the value you bring, it's a simple report.

Learn how to deliver sub-second analytics over TB-scale datasets in our next Live Product Showdown

Read all the posts

Intrigued? Want to read some more?

Curious to learn more?