We have all experienced data quality challenges in one form or another: incomplete or missing data, metrics reported differently in different dashboards,processing errors, stale data, downstream impact of schema changes etc. How to take on data quality challenges without creating a whole bunch of work. Can ‘dbt’ help? If you have transformations to build, dependencies to manage, documents to write, dbt is a great place to start. First, dbt provides a building block approach to building your data models and implementing them in a platform agnostic manner. Second, it makes it easy to incorporate data quality tests. Finally, it provides lineage and documentation; all roads paving the way to better data quality.
Implementing data quality tests with dbt
dbt allows you to incorporate data quality tests within your data transformation process. These tests can check for uniqueness, not-null conditions, accepted values, and relationships. Let's look at an example where we test the product and transaction tables in an e-commerce dataset:
Sample set of tests with ‘dbt’ could be implemented in YAML as shown below. This ensures that product_id is unique in the dimensions table and is not null while product category needs to be one of the specified values. Finally, the customer id in the transactions table needs to exist in the customer dimension table. This set of specifications allows the developer to validate the data as it comes in.
Custom tests in dbt
dbt allows you to create custom tests to validate specific business rules or identify issues, like duplicate records. For instance, the following SQL query can be used to identify duplicate entries in a table:
This SQL statement can be implemented as a custom test within dbt, allowing you to quickly identify and address any duplicate records in your data.
Using dbt extensions for enhanced data quality checks
dbt supports extensions in the form of packages, which can be used to extend the set of pre-built tests available. One such package is 'dbt-expectations,' inspired by the open-source 'Great Expectations' tool. Using dbt-expectations, you can add a variety of new tests to your data models, such as checking for data distribution, row counts, and more.
To use dbt-expectations, first, add the package to your packages.yml file:
Then, run dbt deps to install the package. Once installed, you can use the additional tests provided by the package in your data models. For example, to check if the number of rows in a table falls within an expected range, you can use the row_count test:
These dbt examples demonstrate how you can build data quality checks directly into your data transformation processes, ensuring that data issues are identified and addressed promptly. By incorporating tests for standard data quality checks, custom business rules, and utilizing packages for enhanced testing capabilities, you can significantly improve the quality and trustworthiness of your data.
How Firebolt helps …
Unintuitively, data platform performance and data quality are inter related. If it takes a platform five minutes to ingest an hour's data, but ten minutes to verify the correctness of that data, data teams are less likely to engage in data quality work. It costs too much in both time and platform costs.
Firebolt's native indexing accelerates queries far beyond the capabilities of other cloud data warehouse platforms, allowing data quality queries to complete at very low cost. Leveraging Firebolt's primary, join and aggregating indexing not only saves money, but reduces validation time, allowing for highly efficient data quality verification. Think of an aggregating index that simplifies the process of aggregating the number of units sold to validate a business rule. With Firebolt, using an aggregating index eliminates the need to recalculate the aggregate every time the test is run.
Not only does this get the quality check done, but it is extremely fast as well.
Extreme query performance = extreme data quality.