Welcome Guest. | Log In| Register | Membership Benefits

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Home
Digital Library
Events
RSS | Newsletters
Webcasts


April 10, 2000 Volume 3 - Number 6


Use your standard query tools to track data quality and lineage

Indicators of Quality


Ralph Kimball                

As data warehouse managers, we always have an eye on data quality. because our users trust us implicitly, we worry whether our data is accurate and complete. We occasionally have nightmares that the auditors will ask us how the data got into the database, and what exactly our assumptions were when we computed a certain number.

Certainly as data warehouses move closer to the operational interfaces of our Web-enabled businesses, they are more likely to become “systems of record.” For instance, how many of you are computing salesperson commissions from the data warehouse? If you are, then you need to speak to data quality and lineage when the first legal dispute arises over the calculation of a large commission. The data warehouse industry uses the term lineage to describe the traceable origins and ownership of something. Other endeavors, such as art collecting and stamp collecting, use the word provenance to describe the same thing.

Let’s make a list of data quality and lineage descriptors that we would like to have available when we face the auditors, or maybe when we are just running a report and are curious about the underlying assumptions. (See Figure 1)

Figure 1 Data quality and lineage descriptors.
Quantitive data quality measures:
Overall data quality score
Completeness of the data collection relative to maximum possible
Number of underlying data elements
Number of not-applicable data elements encountered in the input
Number of corrupt data elements encountered in the input
Number of out-of-bounds data elements encountered in the input
Number of unknown data elements treated as zero (or mean value)
Number of data elements changed manually in load process
Number of data elements not classified in regular aggregates
Number of corrections posted since original data load

Data quality processing indicators:
Extract step completion date/time
Name and address matching step completion date/time
Surrogate key generation date/time
Value scan step date/time
Aggregation create/update date/time
Data available online date/time
Last correction posted date/time

Environmental descriptors:
ETL system master version number
Allocation logic version
Plan version
Budget version
Sales region version
Currency conversion version


Although very useful, this list seems a bit too idealistic. For example, it raises a number of tough questions: How do we actually use these data quality and lineage indicators? Where do we store them? Is this list data or metadata? If it is metadata, then how do we keep all these indicators tightly coupled to the real data? What is the granularity of these indicators? How can we apply these useful indicators to high-level data that comes from many different sources?

Start With Lowest Possible Grain

Let’s try to answer several of these questions with a single compact design. Things become a little clearer when we remember that the most expressive and flexible data is always the lowest-level data. A fact table record that represents a single transaction or a single snapshot at a point in time has the most dimensions because more of the surrounding descriptors take on a single value. When you aggregate data, you are forced to prune off your dimension list. Aggregated data becomes less focused.

We take the aggressive approach of attaching data quality indicators directly to the data itself. (See Figure 2) All you have to do is add a simple “audit” key to the original data record at the lowest level of granularity. A four-byte key should be sufficient for any variation on our data quality tracking theme.

Figure 2 A typical fact table at the lowest level of granularity augmented with an audit dimension that enables data quality lineage reporting without a special metadata tool.


The audit key is a meaningless integer key that serves only to join to an audit dimension. (In other words, it is a typical data warehouse surrogate key.) The audit dimension contains the immediate data quality and lineage context for the particular fact table record. Besides the key, the audit dimension fields include some descriptors. (See Figure 3)

Figure 3 Audit dimension field descriptors.

Out-of-bounds indicator
Expected but missing indicator
Artificially supplied indicator
Changed after original load indicator
Extract step completion date/time
Name and address matching step
     completion date/time
Surrogate key generation date/time
Value scan step date/time
Aggregation create/update date/time
Data available online date/time
Last correction posted date/time
ETL master version number
Allocation logic version
Plan version
Budget version
Sales region version
Currency conversion version


Every field in this list has a single well-defined meaning for each fact record. The first four indicators are text values. For instance, the artificially supplied indicator could have the values of “not applicable,” “zero,” “estimated mean value,” or “estimated zero variance value.” Data miners will find these choices familiar because filling in estimated values to cause the least disruption to the overall data set is an important technique for addressing missing or corrupt data.

The seven date/time stamps usually have legitimate date/time values, but must be capable of representing null values in order to handle situations where a step is not performed.

The extract-transform-load (ETL) master version number is a key to the current description of the ETL software suite. The ETL software librarian, whose job it is to maintain a complete list of all ETL components, including storage location, individual version numbers, and backup status, should be maintaining the master version number. The full description of the ETL software suite is a large record in a separate table (not shown here) that could easily consist of hundreds of items. In my DBMS article “Meta Meta Data Data” (March 1998), I listed approximately 80 of the metadata components in such a suite. Whenever we change any aspect of the ETL software suite, the low-order digits of the ETL system master version number must change as well. If we administer this ETL metadata correctly, not only will the audit trail be very specific, but we will be able to restore a consistent, complete set of ETL components. This view of an ETL system mimics the responsibilities of a software librarian in a product software development shop.

Perhaps you are wondering whether you could supply a companion front-end tools master version number into this audit record, but in my opinion, this isn’t the place for such a key. While such a tools-oriented master version number is probably created with same kind of process as the ETL master version number, the data itself can’t guarantee that a user is viewing the data through a specific version of the front-end tools, and therefore such a master version number would be meaningless in understanding the lineage of a particular report.

The final five version fields in our audit dimension are all text entries that describe the overall business assumptions you use to allocate, map, and combine the data.

Our resulting audit dimension record should be of low cardinality compared to the fact table. All the fact table records loaded in the same batch run of the ETL system will probably have the same audit key, except for the few exceptional records that you have to modify or artificially supply. These exceptional records will generate only a few more audit keys.

In this article, we are building a framework for describing the quality and lineage of measurement-oriented fact table records. If we wish to describe the quality and lineage of dimension table records such as customer descriptions, then we can use many of the same techniques, but we would add the new audit fields into the existing dimension records.

Reporting Aggregate Data Quality

The audit dimension lets a conventional query tool report on the data quality, ETL environment, and business logic assumptions of any set of data. Just use the audit dimension as you would any other dimension. You can group by or constrain on any of the audit attributes, and you can use your normal query and reporting tools to display the results. You don’t need a custom metadata reporting tool to do any of this.

We can explore many of the interesting data quality and lineage issues just by using the audit dimension as I described previously. But there are some questions that are defined only for aggregate data sets. The overall data quality score and the completeness of the data collection relative to the maximum possible measures we included in our original requirements list are good examples.

When you are reporting on the completeness of a set of data at an aggregate level, you can’t mark a data element as missing if that means the record isn’t in the database. This dilemma is an example of representing “what didn’t happen” in a database. In all analyses of what didn’t happen, somewhere in the database you have to describe two things: 1) what did happen, and 2) what is the universe of all possibilities. You then subtract #1 from #2 to find out what didn’t happen. For a more complete perspective on this issue, see my Intelligent Enterprise article “What Didn’t Happen” (February 16, 1999).

In our example, you need to decide how to encode #2, the universe of all possibilities. Sometimes you can do this by adding only a few records to the database, especially when you know the records exist but somehow your data delivery pipeline failed to deliver them in a timely way. For example, if you are a retail operation with 600 stores, and in this morning’s load you received only 598 sets of data, then in your daily store totals aggregate fact table, you should include artificially generated records for the two missing stores. You will need an audit key and a shrunken audit dimension for the aggregate fact table. In this shrunken audit dimension, or possibly in the fact table, you can include completeness indicators. Now you can easily generate reports at various aggregate levels that include a measure of data completeness.

Building the Audit Dimension

I hope I’ve made a compelling case for an audit dimension’s usefulness. But like so many interesting designs in data warehousing, the hard part is actually supplying the data for the audit record. We need to actually build most of the quality and lineage indicators into the ETL pipeline. You need to augment each “data flow” in the ETL pipeline with exception-handling routines that diagnose each record as it is being built. In addition, the environment variables, including the date/timestamps and the version descriptions, need to be waiting to be put into the current audit record. This goal implies, of course, that you must carefully log all the required date/timestamps and version descriptions. (How does the saying go? “If we had some ham, then we could have some ham and eggs; if we had some eggs….”)



Ralph Kimball, Ph.D., co-invented the Star Workstation at Xerox and founder of Red Brick Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him through his Web page at www.ralphkimball.com.

RESOURCES

“Meta Meta Data Data”(DBMS, March 1998):
www.dbmsmag.com/9803d05.html

“What Didn’t Happen”(Intelligent Enterprise, February 16, 1999):
www.intelligententerprise.com/db_area/archives/1999/991602/warehouse.jhtml





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address