Kimball University: The Subsystems of ETL Revisited > > Intelligent Enterprise: Better Insight for Business Decisions

Welcome Guest. | Log In| Register | Membership Benefits

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Home
Digital Library
Events
RSS | Newsletters
Webcasts


  • EMAIL
  • PRINT
  • REPRINTS
  • Follow Us on Twitter
  • FOLLOW US
  • Share

Kimball University: The Subsystems of ETL Revisited


These 34 subsystems cover the crucial extract, transform and load architecture components required in almost every dimensional data warehouse environment. Understanding the breadth of requirements is the first step to putting an effective architecture in place.


By Bob Becker
October 21, 2007

Through education and consulting work, Kimball Group has been exposed to hundreds of successful data warehouses. Careful study of these successes has revealed a set of extract, transformation, and load (ETL) best practices. We first described these best practices in an Intelligent Enterprise column three years ago (see "The 38 Subsystems of ETL" ). Since then we have continued to refine the practices based on client experiences, feedback from students and continued research. As a result, we have carefully restructured these best practices into 34 subsystems that represent the key ETL architecture components required in almost every dimensional data warehouse environment. No wonder the ETL system takes such a large percentage of data warehouse and BI project resources!

The good news is that if you study these 34 subsystems, you'll recognize almost all of them and will be on the way to leveraging your experience as you build your ETL system. While we understand and accept the industry's accepted acronym, the "ETL" process really has four major components: Extracting, Cleaning and Conforming, Delivering and Managing. Each of these components and all 34 subsystems contained therein are explained below.

EXTRACTING: GETTING DATA INTO THE DATA WAREHOUSE

To no surprise, the initial subsystems of the ETL architecture address the issues of understanding your source data, extracting the data and transferring it to the data warehouse environment where the ETL system can operate on it independent of the operational systems. While the remaining subsystems focus on the transforming, loading and system management within the ETL environment, the initial subsystems interface to the source systems to access the required data. The extract-related ETL subsystems include:

Data Profiling (subsystem 1) — Explores a data source to determine its fit for inclusion as a source and the associated cleaning and conforming requirements.

Change Data Capture (subsystem 2) — Isolates the changes that occurred in the source system to reduce the ETL processing burden.

Extract System (subsystem 3) — Extracts and moves source data into the data warehouse environment for further processing.

CLEANING AND CONFORMING DATA

These critical steps are where the ETL system adds value to the data. The other activities, extracting and delivering data, are obviously important, but they simply move and load the data. The cleaning and conforming subsystems change data and enhance its value to the organization. In addition, these subsystems should be architected to create metadata used to diagnose source-system problems. Such diagnoses can eventually lead to business process reengineering initiatives to address the root causes of dirty data and to improve data quality over time.

The ETL data cleaning process is often expected to fix dirty data, yet at the same time the data warehouse is expected to provide an accurate picture of the data as it was captured by the organization's production systems (see related article, "Data Stewardship 101: First Step to Quality and Consistency). It's essential to strike the proper balance between these conflicting goals. The key is to develop an ETL system capable of correcting, rejecting or loading data as is, and then highlighting, with easy-to-use structures, the modifications, standardizations, rules and assumptions of the underlying cleaning apparatus so the system is self-documenting.

The five major subsystems in the cleaning and conforming step include:

Data Cleansing System (subsystem 4) — Implements data quality processes to catch quality violations.

Error Event Tracking (subsystem 5) — Captures all error events that are vital inputs to data quality improvement.

Audit Dimension Creation (subsystem 6) — Attaches metadata to each fact table as a dimension. This metadata is available to BI applications for visibility into data quality.

Deduplication (subsystem 7) — Eliminates redundant members of core dimensions, such as customers or products. May require integration across multiple sources and application of survivorship rules to identify the most appropriate version of a duplicate row.

Data Conformance (subsystem 8) — Enforces common dimension attributes across conformed master dimensions and common metrics across related fact tables (see related article, "Kimball University: Data Integration for Real People").


  • EMAIL
  • PRINT
  • REPRINTS
  • Follow Us on Twitter
  • FOLLOW US
  • Share


 





New on the BLOG
Is Oracle An Apps Company that Sells Hardware, or a Hardware Company that Dabbles in Apps?
09. 8.2010
blog author
Josh Greenbaum
Oracle is now more firmly a hardware company than it was last week, when all it had in the hardware category was a company called Sun. Indeed, with Mark's ascendancy and Charles' "retirement," the real question is whether Oracle's applications product line will ever become a major part of the company's product focus again.

Read more from Josh Greenbaum >>

Seth Grimes
Semantics and Analytics Unlock Value in Social and Online Content
Facebook, LinkedIn, Trip Advisor, and Twitter -- social media -- are almost incidental, replaceable tomorrow if another platform proves more attractive, powerful, and agile. It's content that is king, especially "smart content" that allows producers and consumers alike to find the greatest value in online and enterprise content.

09. 8.2010
Read more from Seth Grimes >>

Using Virtualization for "Smart Multi-Tenancy"
09. 8.2010
blog author
Rajan Chandras
The old debate on single- vs. multi-tenancy for SaaS solutions has become really interesting, the game changer being virtualization... Who's right? Of course, there are two sides to every argument.

Read more from Rajan Chandras >>



Intelligent Enterprise Newsletters
Subscribe Here:
*Email:
 First Name:
 Last Name:
  Intelligent Enterprise Blogosphere Newsletter:
  Intelligent Enterprise Newsletter:

Email Type:





Enabling People and Organizations to Harness the Transformative Power of Technology