Kimball University: The Subsystems of ETL Revisited > > Intelligent Enterprise: Better Insight for Business Decisions

Welcome Guest. | Log In| Register | Membership Benefits

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Home
Digital Library
Events
RSS | Newsletters
Webcasts


  • EMAIL
  • PRINT
  • REPRINTS
  • Follow Us on Twitter
  • FOLLOW US
  • Share

Kimball University: The Subsystems of ETL Revisited


These 34 subsystems cover the crucial extract, transform and load architecture components required in almost every dimensional data warehouse environment. Understanding the breadth of requirements is the first step to putting an effective architecture in place.


By Bob Becker
October 21, 2007

Through education and consulting work, Kimball Group has been exposed to hundreds of successful data warehouses. Careful study of these successes has revealed a set of extract, transformation, and load (ETL) best practices. We first described these best practices in an Intelligent Enterprise column three years ago (see "The 38 Subsystems of ETL" ). Since then we have continued to refine the practices based on client experiences, feedback from students and continued research. As a result, we have carefully restructured these best practices into 34 subsystems that represent the key ETL architecture components required in almost every dimensional data warehouse environment. No wonder the ETL system takes such a large percentage of data warehouse and BI project resources!

The good news is that if you study these 34 subsystems, you'll recognize almost all of them and will be on the way to leveraging your experience as you build your ETL system. While we understand and accept the industry's accepted acronym, the "ETL" process really has four major components: Extracting, Cleaning and Conforming, Delivering and Managing. Each of these components and all 34 subsystems contained therein are explained below.

EXTRACTING: GETTING DATA INTO THE DATA WAREHOUSE

To no surprise, the initial subsystems of the ETL architecture address the issues of understanding your source data, extracting the data and transferring it to the data warehouse environment where the ETL system can operate on it independent of the operational systems. While the remaining subsystems focus on the transforming, loading and system management within the ETL environment, the initial subsystems interface to the source systems to access the required data. The extract-related ETL subsystems include:

Data Profiling (subsystem 1) — Explores a data source to determine its fit for inclusion as a source and the associated cleaning and conforming requirements.

Change Data Capture (subsystem 2) — Isolates the changes that occurred in the source system to reduce the ETL processing burden.

Extract System (subsystem 3) — Extracts and moves source data into the data warehouse environment for further processing.

CLEANING AND CONFORMING DATA

These critical steps are where the ETL system adds value to the data. The other activities, extracting and delivering data, are obviously important, but they simply move and load the data. The cleaning and conforming subsystems change data and enhance its value to the organization. In addition, these subsystems should be architected to create metadata used to diagnose source-system problems. Such diagnoses can eventually lead to business process reengineering initiatives to address the root causes of dirty data and to improve data quality over time.

The ETL data cleaning process is often expected to fix dirty data, yet at the same time the data warehouse is expected to provide an accurate picture of the data as it was captured by the organization's production systems (see related article, "Data Stewardship 101: First Step to Quality and Consistency). It's essential to strike the proper balance between these conflicting goals. The key is to develop an ETL system capable of correcting, rejecting or loading data as is, and then highlighting, with easy-to-use structures, the modifications, standardizations, rules and assumptions of the underlying cleaning apparatus so the system is self-documenting.

The five major subsystems in the cleaning and conforming step include:

Data Cleansing System (subsystem 4) — Implements data quality processes to catch quality violations.

Error Event Tracking (subsystem 5) — Captures all error events that are vital inputs to data quality improvement.

Audit Dimension Creation (subsystem 6) — Attaches metadata to each fact table as a dimension. This metadata is available to BI applications for visibility into data quality.

Deduplication (subsystem 7) — Eliminates redundant members of core dimensions, such as customers or products. May require integration across multiple sources and application of survivorship rules to identify the most appropriate version of a duplicate row.

Data Conformance (subsystem 8) — Enforces common dimension attributes across conformed master dimensions and common metrics across related fact tables (see related article, "Kimball University: Data Integration for Real People").


  • EMAIL
  • PRINT
  • REPRINTS
  • Follow Us on Twitter
  • FOLLOW US
  • Share


 





New on the BLOG
5 Opportunities and 3 Threats for Oracle
02. 9.2010
blog author
Rajan Chandras
With the acquisition of Sun, Oracle now has a few things going for it, including something no other IT giant has -- not IBM, not Microsoft, and not SAP. And lurking also are a few challenges.

Read more from Rajan Chandras >>

Cindi Howson
Is Gartner's Quadrant the Problem, Or Is It How It's Used?
Bashing Gartner's Magic Quadrants seems to be a popular industry pastime, but in truth, I kind of like the quadrants. My biggest gripe is in how the quadrants are used, not necessarily the quadrants themselves...

02. 8.2010
Read more from Cindi Howson >>

Clarabridge Asks, Are You Customer Experienced?
02. 5.2010
blog author
Seth Grimes
Add "customer" to Jimi Hendrix' song title and you have a question central to last week's Clarabridge Customer Connections (C3) conference, Are You Customer Experienced?

Read more from Seth Grimes >>



Intelligent Enterprise Newsletters
Subscribe Here:
*Email:
 First Name:
 Last Name:
  Intelligent Enterprise Blogosphere Newsletter:
  Intelligent Enterprise Newsletter:

Email Type: