Data Engineering

Understanding ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform)

Data integration is fundamentally about moving data from diverse sources into a data warehouse (in this case, a cloud data warehouse). ETL and ELT are two ways of doing this, and despite their similar names, they offer significantly different options.

ETL involves gathering data from various sources, cleaning and formatting it to fit the destination system or data warehouse, and loading the transformed data into the destination system, often a data warehouse or data lake. ETL helps enterprises consolidate data for more effective business intelligence and data analysis. 

ELT involves extracting data gathered from a variety of sources, loading raw data into a destination system, and then transforming it within the destination system for analysis. ELT leverages modern data warehousing power for rapid data integration and transformation. As such, ELT is highly suitable for big data scenarios.

ETL can be resource-intensive, but it provides some key benefits. Since ETL is an older, well-established model, more frameworks and tools are available. ETL offers the unique capability of excluding sensitive data before loading it into a system, making ETL a stronger option to stay compliant with regulatory frameworks like CCPA, PCI DSS, GDPR, and HIPAA.

ELT works better with massive volumes of data. It is typically a cloud-based solution that offers modern advantages like enhanced speeds at all stages of the process and easier upkeep. However, since ELT doesn’t have an option of excluding any data before loading it into a system, it is more susceptible to compliance failures.

Examples of Open-source ETL/ELT tools:

Apache NiFi: Apache NiFi is an open-source data integration tool that enables the automation of data flows between disparate systems. It is designed to scale out in a clustered configuration, allowing for high-volume data flow. NiFi's intuitive interface allows for easy design, control, feedback, and monitoring of data pipelines.

Talend: Talend is a comprehensive open-source ETL tool that offers data integration, data management, enterprise application integration, data quality, and Big Data software solutions. It's a Java-based tool which makes it platform-independent.

Pentaho: Pentaho is a data integration platform that provides ETL capabilities. It's open-source, and its strong point is the broad toolset for a variety of tasks, such as reporting, data mining, and visualization. It also includes machine learning capabilities.

Examples of Commercial ETL/ELT tools and Cloud Native tools:

Fivetran: Fivetran is a cloud-based, fully-managed ELT tool. Fivetran provides connectors to pull data from various sources and load them into a data warehouse for analytics. It shines with its simplicity and robustness, eliminating the need for data pipeline maintenance.

Informatica PowerCenter: Informatica PowerCenter is a widely used ETL tool that supports the entire data integration lifecycle, from jumpstarting your first project, to scaling enterprise-wide. It offers high performance, scalability, and flexibility to drive better business decisions and more effective engagements.

Cloud native offerings like AWS Glue, Azure Data Factory, Google Cloud Datafusion are additional examples. These offerings provide managed services in the cloud that are billed based on consumption.  ETL/ELT pricing has been the subject of many blogs as licensing and provisioning models vary. Suitability of each offering from a price-performance perspective requires significant due diligence.

Data Ingestion

Data ingestion is the first process in a data pipeline and one of its most integral components. A data pipeline is a series of processes that involves gathering raw data from diverse sources and transporting that data to a unified environment, such as a cloud data warehouse. Data freshness and latency play a crucial role in data ingestion, ensuring that the most recent data is available for processing and analysis. Balanced with cost-effectiveness and simplicity, they enable rapid, high-quality analysis. 

Different kinds of data ingestion processes serve specific needs. 

Real-time ingestion involves a constant and steady stream of data coming into a cloud data warehouse; it is then processed and dispatched where needed. Optimal use cases for real-time ingestion include stock market analysis, customer behavior, IT infrastructure reporting, and predictive analytics technology.

Batch ingestion is where huge volumes of data are brought into a location but are processed, for various reasons, at a later stage. This is good for scenarios where organizations don’t require or want an immediate output, for example, employee payrolls or the collection of invoices.

Change data capture (CDC) is a process that involves real-time or near real-time ingestion of data. Here, only data that has undergone a change in a source database is identified and sent to be processed in real time. It's an effective way to bypass the delays of batch ingestion, and it can help companies optimize costs and make quicker business decisions. 

Here are some examples of data ingestion tools in the market.

  • Logstash: Part of the Elastic Stack (formerly ELK stack), Logstash is a server-side data processing pipeline that ingests data from a multitude of sources, transforms it, and then sends it to a "stash" like Elasticsearch. It's highly effective with log or event data.
  • Striim: An end-to-end data integration and streaming analytics platform that supports real-time, batch, and CDC data ingestion; and its AI capabilities can help with data processing and decision-making.
  • Confluent Platform: A full-scale event streaming platform, providing enterprise-level capabilities for data ingestion in real-time and batch processing, with added security, reliability, and management features.

CDC based tools such as Oracle Goldengate, Qlik Replicate etc provide the ability to move data in hybrid environments (on-premises and cloud). Additionally, data warehouse vendors provide their own sets of tools that support batch ingestion and/or streaming scenarios as well.

Data Transformation

Data comes in all shapes and sizes, including obscure formats. It can be from a diverse range of sources and be either structured or unstructured. Certain data may be of high quality, while other data may be redundant, broken, or misleading.

There is a commonly accepted structure to the data transformation process. It typically begins with data discovery and mapping, followed by the design of code to power the data transformation process. Once the code is ready, it is then executed for the data transformation process to begin. After the input data is transformed into the desired output format, organizations need to implement meticulous data review and testing to ensure its quality. 

The benefits of data transformation are immense, making it a key step toward embracing a completely data-driven culture or approach. This is because it pushes data quality to the forefront. It helps organizations better understand, manage, utilize, and leverage their data for maximum gain. 

Data transformation is a focal point of data engineering, where disparate data is transformed into powerful business assets that are ready to be deployed to end users.

Data Build Tool (dbt) has become a popular tool in modern analytics and data warehousing. Once raw data is loaded into the data warehouse, dbt provides the “T” in ELT. It allows data engineers to develop SQL based transformations. It also provides integrations with Git, allowing for data version controlled data modeling.  Data quality checks in dbt can be leveraged to catch data integrity issues in the data pipeline.  Dbt also generates documentation and lineage information for consumption.

Data Orchestration

Data orchestration is one of the most important aspects of data engineering. Orchestration refers to the automated management, coordination, and monitoring of key workflows and activities in a data engineering pipeline.

A data silo is a large clump of data that’s only accessible to one or a few teams within an organization. This can be detrimental to other departments as well as the entire organization. Data orchestration involves automated processes to break down silos and distribute siloed data evenly and in a manner that’s useful, accessible, and integrative. 

The more complex data engineering pipelines require more robust data orchestration solutions. Data engineers need to ensure that orchestration includes the flow of supporting metadata as well since data without metadata is not a valuable business asset. Some examples of metadata that need to be orchestrated alongside data include pipeline, technical, operational, schema, data-lineage, and reference metadata. 

A data orchestration tech stack comprises mechanisms that aid the transit of data from silos to their application destinations. It also includes a metadata layer providing context and sharing semantics with disparate data sets. These capabilities are bolstered by a suite of data transformation and cleansing tools, ensuring that the data is in the correct format and of high quality before it reaches its final destination. 

Data orchestration is never about automating data processes simply for the sake of automation. It’s always about optimizing pipelines and ensuring that data is reaching end users in a timely, safe, and efficient manner. 

For example, open-source data integration tools such as Apache NiFi enable the automation of data flow between systems. Similarly, the open-source tool Airflow by Apache programmatically authors, schedules, and monitors workflows and is often used in orchestrating machine learning pipelines. Astronomer.io is a managed Apache Airflow solution that takes a bite out of the complexity of Airflow. AWS offers Managed Workflows for Apache Airflow and Google offers Cloud Composer - both managed airflow offerings in the Cloud. Prefect and Dagster are additional orchestration engines that are gaining popularity in the modern data stack.

Data orchestration is a critical element of DataOps, DevOps, and any other agile and highly scalable approaches. Modern businesses prioritize operational efficiency and speed. Robust data orchestration is a powerful component of the data engineering tech stack that can help companies make assertive data-driven business decisions.

Data Quality

Data quality is a more complex measurement than it appears. There are key attributes that “good-quality data” should ideally have, such as consistency, accuracy, and timeliness. However, from a data engineering perspective, those attributes, along with others like completeness, uniformity, uniqueness, coherence, and credibility, shouldn’t neglect the business context. They need to be highly contextualized to a specific company’s data goals to be of value.

A data engineer’s definition of data quality should be the measurement between an enterprise’s expectations of their data and the current state of their data. Every organization needs to set its own high standards and parameters when it comes to data governance and quality management. It’s a data engineer’s responsibility to frame those benchmarks and ensure they are upheld throughout the entire data engineering cycle.

Most drops in data quality occur due to software engineering mishaps, particularly when standardized practices are neglected or not established. Data engineers should aim to address the root cause of data quality issues rather than catch them at later stages of a data pipeline. Collaboration with software engineers is highly recommended, along with modifying inconsistent practices in the early stages of a data journey. Deep standardization at every step of data engineering will likely solve the majority of data quality challenges.

A data engineer should integrate a company’s data and data stewards regularly and meticulously. A marriage of shrewd data leadership and advanced data quality testing and tools is required for this. Binary questioning via data quality tools has historically been a successful way of identifying data quality depressions. Simple questions can go a long way, as long as they are standardized and posed regularly and at the right junctures.

Most tools described earlier provide the ability to validate data as it moves through the pipeline. Data engineers can enable data checks as required using this approach. Various data quality solutions such as Great Expectations, Talend Data Quality, Infosphere QualityStage can be used to profile and identify data anomalies as they happen. When selecting data quality tools, consider factors like the scale of data, complexity of infrastructure, types of quality challenges (missing data, duplicate data, inconsistent data). Clean and trustworthy data is critical to the success of data analytics and machine learning applications.

Data Observability

Data Observability is a robust and critical layer in the modern enterprise’s data tech stack. It refers to the extent of visibility of a company’s data and data-related lifecycles, processes, and pipelines. 

Data observability needs to envelop every step of data engineering. An observability blind spot in one section of a data engineering lifecycle can have lengthy and disastrous consequences. End-to-end data observability is especially critical for businesses because it allows for detecting unknown problems. This significantly differs from handling data quality, wherein companies search for and identify known problems with binary logic queries.

From a data engineering perspective, it’s important to remember that observability is not the same as testing and monitoring. Data observability is far more proactive and focuses on predicting data-related disruptions and downtime. This is done via a multitude of automated mechanisms that can be integrated into various sections of the data engineering process. Data observability at its full potential can free up data engineers from having to reactively address challenges without knowing the root cause.

Furthermore, observability is more than just viewing a company’s data landscape. It provides documentation using automated tools and technologies for data lineage. This enables data engineers to easily pinpoint the reason for a data compromise or a period of downtime and act swiftly and strategically to remediate it. 

Lastly, data observability helps surveil the health of data that’s static as well as in transit. It can significantly augment DevOps initiatives by unclogging data pipelines, maintaining data quality, Increasing operational efficiency and reducing downtime.

Examples of data observability tools range from observability platforms such as Datadog, Dynatrace, New Relic, Grafana, App Dynamics to more data focused options such as Monte Carlo, Datafold, Acceldata etc. Data observability tools provide features such as data lineage, alerting, anomaly detection, quality metrics, integration with other parts of the data engineering process such as ELT/ELT tools etc. 

While we covered most topics, shouldn’t data governance be a critical topic that should be covered here. Absolutely, the purpose of the prior section is to address various tools that are part of the data pipeline. Data governance, in our opinion, is a much broader topic that requires one to address metadata, access, collaboration, data classification and cataloging, compliance/security and more.  As with other modern data stack solutions, numerous solutions such as DataHub, Collibra, Informatica, Alation, Atlan, Talend etc play in this space.

Next: Cloud Data Warehouse - Features and Benefits

Continue reading
Send me as pdf