Data Warehousing Fundamentals

Before delving into the intricacies of cloud data warehousing, it’s important to get acquainted with key data warehousing terms, concepts, and the basics to help you navigate the world of cloud data warehousing. 

An enterprise data warehouse (EDW) is typically defined by four key characteristics:

  • Subject orientation: An enterprise data warehouse encompasses all business operations and structures them into subjects such as sales data, supply chains, customers, and products. 
  • Time variance: Time variance refers to the feature of data being timestamped so that it can be more easily identified and managed. For instance, a user can access data from a three-month, six-month, or one-year time period if required. Time variance also ensures that data can’t be modified once it’s in a warehouse because it’s stored in a time dimension. 
  • Non-volatility: Being non-volatile refers to the fact that enterprise data warehouses are mechanized only to add data and not erase or alter existing data by creating a new timestamped entry. This is critical for any business because it allows it to analyze historical data at any given point to discover historical patterns or view current patterns through a historical lens. 
  • Integration: A business needs standardized and contextualized data. A key component of enterprise data warehouses is the integration of data from highly variable formats and sources to assist with effective and efficient analysis. 

Structured Query Language (SQL) is a programming language such as Python, JavaScript, or C++. The main difference between SQL and other programming languages is its application. SQL primarily serves databases and is universally accepted as the standard language for databases. All actions within a DB can be articulated and prompted using SQL. 

Data modeling is a way for companies to contextualize, connect, and create a conceptual or logical visualization of their data. In simple terms, it helps you make sense of your data. It creates a standardized environment that helps diverse stakeholders within an organization access and leverage data simply and effectively. 

Data engineering is a series of practices, processes, and protocols that help organizations efficiently take in vast amounts of raw data from different sources and use it for various purposes. Data engineering is about constructing the most optimal data pipeline, one that isn’t generic and meets the specific requirements of a particular enterprise. A data pipeline refers to a set of connected processes that covers everything from raw data ingestion and integration to transformation, storage, and assetization for delivery and end-user application.

SQL Fundamentals for Data Warehousing

As mentioned, SQL is a programming language specific to databases. The foundations of SQL were developed by IBM in the 1970s. From 1979 into the early 1980s, SQL became publicly available. Since then, it has become a standardized language, which means it’s highly portable and easy to apply across various databases.

SQL’s key capabilities begin with Data Definition Language (DDL) and Data Manipulation Language (DML). DDL is built around critical language commands that include Create, Use, Alter, and Drop. DML is built around language commands like Insert, Update, Delete, and Merge. 

Also noted above, data modeling offers companies a standardized environment from which key users and stakeholders can access and utilize data. DDL is used to create a data model, a highly contextualized and streamlined data environment. DML is used to integrate data into that model. SQL is the language that implements a model and triggers a smooth flow of data into that model. Together, DDL and DML comprise the main building blocks of any database. 

Data Control Language (DCL) helps manage permissions and access to data. Common DCL language commands include Grant and Revoke. Data Query Language (DQL), a subset of DML whose main command is Select, is used to locate and retrieve data within a database. Transaction Control Language (TCL) is utilized for transaction management within a database; key TCL commands include Commit, Rollback, and Savepoint. 

Select statements are perhaps the most used SQL command. They allow users to access a certain dataset or database and select specific data from within it. Join clauses are an extension of the Select command. These clauses select data from different tables or columns and connect them in a defined data model. The four main types of Join clauses are Inner Join, (Outer) Left Join, (Outer) Right Join, Cross Join, Self Join, and Full (Outer) Join.

Data transformation takes place to satisfy certain intricate business needs. Some of those needs may be complex, and others relatively simple. For example, people analytics is a way for businesses to study and optimize the performance of their employees and teams. To accomplish this, specific employee data needs to be pulled from the system and amalgamated. Employee data could include names, demographic information, salary details, contact information, and job titles. This kind of data transformation can be done ad hoc, during data loading, or during specific queries with a Select statement.

Simple SQL functions, namely, Aggregate and Scalar, can collect and transform this data into an optimized framework that suits specific requirements. 

Aggregate functions can calculate different data values across a database. An easy way to understand aggregate functions is to visualize a common spreadsheet application and the kind of mathematical calculations that can be performed on it. Scalar functions enable more complicated calculations. They consider multiple parameters and complexities to provide a single value output.

Introduction to Data Modeling

Dimensional data modeling identifies the most important processes within an organization and prioritizes data modeling for those processes before addressing anything else. In 1996, dimensional modeling was introduced to the world by Ralph Kimball in his book The Data Warehouse Toolkit

By prioritizing only the most important aspects of a business’s operations, dimension modeling makes things simpler for organizations. It’s a relatively lightweight model that enhances speed and efficiency. The key to successful dimensional modeling lies in choosing the right processes to mine data and then stripping away all irrelevant, redundant, and low-quality data from large primary data sources. 

To successfully develop a dimensional model, the logical first step is to identify the right processes, dimensions (collections of data attributes based on descriptive commonalities), numerical measures, and the relevant properties or characteristics of each dimension. If the framework for a dimensional model is optimally designed, the model itself will be easy to use, and critical information can be retrieved from the data via simple queries. 

A few data modeling techniques stand above the rest in terms of popularity and effectiveness. Right on top of that list is star schema, which is easily the simplest model available. Star schema only contains quantitative facts and descriptive dimensions. 

The snowflake schema goes beyond the star schema to work with facts and dimensions. However, it breaks down the dimensions into multiple tables to avoid data redundancy and is significantly more complicated than the star schema’s basic database design. This process is called “normalization.” 

Anchor modeling comprises four key components: anchors, attributes, ties, and knots. They represent entities, attributes, relationships between entities, and dependencies between attributes. An anchor model presupposes that the source of data is in constant flux. It, therefore, inherently accommodates change. A data vault is a similar model comprising hubs, links, and satellites representing entities, relationships, and attributes. It’s an optimal solution to analyze historical data and embrace parallel computing. Horizontal scalability can be achieved with minimal disruption by choosing data vault models. 

Focusing on temporal relationships, temporality can be integrated into dimensional and anchor modeling with temporal relations. There are three main types of temporal relations. Uni-temporal relationships have a single time aspect associated with each record, often referred to as "valid time"; bi-temporal relationships have two-time factors like "valid time" and "transaction time"; and tri-temporal relationships track an additional time aspect beyond valid time and transaction time. 

Anchor modeling offers solutions to work with multi-temporal structures. Temporality is a critical feature to include in data models. Time attributes can be complex to deal with, but companies will significantly benefit from the option of making temporal-based queries and decisions. 

The simplicity, efficiency, and scalability of star and snowflake schema make them a leading choice for data warehousing and BI applications. Let’s take a closer look at both. 

Star Schema

One of the most straightforward data warehousing models is the star schema. Its distinct design parameters include a single central table joined to a few single-dimension tables. When five dimension tables are joined to the central table, it can resemble a star. It’s perfect for short and simple queries and straightforward reporting. However, one drawback of star schema is that, due to its simplicity, it’s typically more inefficient or, at times, incapable of dealing with complex queries. 

Star schema helps data teams build denormalized dimension tables. This data structuring strategy is used to organize tables in a manner that introduces redundancy to boost performance. 

A star schema for sales data with the centralized fact table linked to surrounding dimension tables

The example above is a star-type schema that is common in e-commerce processes. Its clean design makes it easy to view and interpret past transactional interactions.

To create this schema for e-commerce transactions: 

Snowflake Schema

This data warehouse schema comprises a logical arrangement of dimension tables, and the table disposition resembles a snowflake. The snowflake schema extends the star schema framework by incorporating additional sub-dimension tables that are connected to the primary dimension tables, which are in turn linked to the fact table. 

Analogous to the relationship between a fact table's foreign key and a dimension table's primary key, a sub-dimension table in the snowflake schema can correspond to a foreign key in its respective higher-order dimension table.

Snowflake schemas also create normalized dimension tables to structure and organize tables and decrease redundancy.

A snowflake schema illustrating its complexity and hierarchical structure

To create this snowflake schema: 

Slowly Changing Dimensions (SCD)

As established above, the secret to robust dimension models is choosing the right processes with which to mine data. SCDs are a specific type of implementation for dimension models. They’re essential because they preserve the history of dimensional data and allow businesses to analyze updates over a period of time. SCDs are a major reason why enterprise data warehouses are non-volatile. 

Data attributes change with time. For example, customers’ personal data, such as addresses and contact information, are rarely going to remain the same. SCDs are dimensions that house data attributes that could potentially change over time but not frequently or at a rapid pace. 

SCDs are particularly critical within cloud data warehousing because the most valuable databases are those that can compare new data with older data to generate insights. In cloud data warehouses, the ability to accommodate gradual changes to data attributes is very important. 

There are many types of SCDs, each with its unique purpose. Type 1 SCDs are for those cases where historical data is permanently overwritten with new inputs. Overwriting or deleting historical data is a useful option if a particular set of data is especially redundant or misleading. Type 2 SCDs house both historical data and new data in the same table; in this case, a record can be stamped with a version number for context. 

A Type 3 SCD involves the addition of an extra column so that old data attributes and newer ones are both visible, typically in the same row but in separate columns. Type 4 separates old data and new data into different tables, with the table housing old data serving as an archive.

Next part: Data Engineering

Continue reading
Send me as pdf