Use your standard query tools to track data quality and lineage |
|
Indicators of Quality |
||||||||
|
||||||||||
As data warehouse managers, we always have an eye on data quality. because our users trust us implicitly, we worry whether our data is accurate and complete. We occasionally have nightmares that the auditors will ask us how the data got into the database, and what exactly our assumptions were when we computed a certain number.
Certainly as data warehouses move closer to the operational interfaces of our Web-enabled businesses, they are more likely to become systems of record. For instance, how many of you are computing salesperson commissions from the data warehouse? If you are, then you need to speak to data quality and lineage when the first legal dispute arises over the calculation of a large commission. The data warehouse industry uses the term lineage to describe the traceable origins and ownership of something. Other endeavors, such as art collecting and stamp collecting, use the word provenance to describe the same thing.
Lets make a list of data quality and lineage descriptors that we would like to have available when
we face the auditors, or maybe when we are just running a report and are curious about the underlying
assumptions. (See Figure 1)
Figure 1 Data quality and lineage descriptors.
|
Quantitive data quality measures:
Overall data quality score Completeness of the data collection relative to maximum possible Number of underlying data elements Number of not-applicable data elements encountered in the input Number of corrupt data elements encountered in the input Number of out-of-bounds data elements encountered in the input Number of unknown data elements treated as zero (or mean value) Number of data elements changed manually in load process Number of data elements not classified in regular aggregates Number of corrections posted since original data load Data quality processing indicators: Extract step completion date/time Name and address matching step completion date/time Surrogate key generation date/time Value scan step date/time Aggregation create/update date/time Data available online date/time Last correction posted date/time Environmental descriptors: ETL system master version number Allocation logic version Plan version Budget version Sales region version Currency conversion version |
Although very useful, this list seems a bit too idealistic. For example, it raises a number of tough questions: How do we actually use these data quality and lineage indicators? Where do we store them? Is this list data or metadata? If it is metadata, then how do we keep all these indicators tightly coupled to the real data? What is the granularity of these indicators? How can we apply these useful indicators to high-level data that comes from many different sources?
Start With Lowest Possible Grain
Lets try to answer several of these questions with a single compact design. Things become a little clearer when we remember that the most expressive and flexible data is always the lowest-level data. A fact table record that represents a single transaction or a single snapshot at a point in time has the most dimensions because more of the surrounding descriptors take on a single value. When you aggregate data, you are forced to prune off your dimension list. Aggregated data becomes less focused.
We take the aggressive approach of attaching data quality indicators directly to the data
itself. (See Figure 2) All you have to do is add a simple audit key to the original data
record at the lowest level of granularity. A four-byte key should be sufficient for any variation on our
data quality tracking theme.
Figure 2 A typical fact table at the lowest level of granularity augmented with an audit
dimension that enables data quality lineage reporting without a special metadata tool.
The audit key is a meaningless integer key that serves only to join to an audit dimension. (In other
words, it is a typical data warehouse surrogate key.) The audit dimension contains the immediate data
quality and lineage context for the particular fact table record. Besides the key, the audit dimension
fields include some descriptors. (See Figure 3)
Figure 3 Audit dimension field descriptors.
|
Out-of-bounds indicator Expected but missing indicator Artificially supplied indicator Changed after original load indicator Extract step completion date/time Name and address matching step completion date/time Surrogate key generation date/time Value scan step date/time Aggregation create/update date/time Data available online date/time Last correction posted date/time ETL master version number Allocation logic version Plan version Budget version Sales region version Currency conversion version |
Every field in this list has a single well-defined meaning for each fact record. The first four indicators are text values. For instance, the artificially supplied indicator could have the values of not applicable, zero, estimated mean value, or estimated zero variance value. Data miners will find these choices familiar because filling in estimated values to cause the least disruption to the overall data set is an important technique for addressing missing or corrupt data.
The seven date/time stamps usually have legitimate date/time values, but must be capable of representing null values in order to handle situations where a step is not performed.
The extract-transform-load (ETL) master version number is a key to the current description of the ETL software suite. The ETL software librarian, whose job it is to maintain a complete list of all ETL components, including storage location, individual version numbers, and backup status, should be maintaining the master version number. The full description of the ETL software suite is a large record in a separate table (not shown here) that could easily consist of hundreds of items. In my DBMS article Meta Meta Data Data (March 1998), I listed approximately 80 of the metadata components in such a suite. Whenever we change any aspect of the ETL software suite, the low-order digits of the ETL system master version number must change as well. If we administer this ETL metadata correctly, not only will the audit trail be very specific, but we will be able to restore a consistent, complete set of ETL components. This view of an ETL system mimics the responsibilities of a software librarian in a product software development shop.
Perhaps you are wondering whether you could supply a companion front-end tools master version number into this audit record, but in my opinion, this isnt the place for such a key. While such a tools-oriented master version number is probably created with same kind of process as the ETL master version number, the data itself cant guarantee that a user is viewing the data through a specific version of the front-end tools, and therefore such a master version number would be meaningless in understanding the lineage of a particular report.
The final five version fields in our audit dimension are all text entries that describe the overall business assumptions you use to allocate, map, and combine the data.
Our resulting audit dimension record should be of low cardinality compared to the fact table. All the fact table records loaded in the same batch run of the ETL system will probably have the same audit key, except for the few exceptional records that you have to modify or artificially supply. These exceptional records will generate only a few more audit keys.
In this article, we are building a framework for describing the quality and lineage of measurement-oriented fact table records. If we wish to describe the quality and lineage of dimension table records such as customer descriptions, then we can use many of the same techniques, but we would add the new audit fields into the existing dimension records.
Reporting Aggregate Data Quality
The audit dimension lets a conventional query tool report on the data quality, ETL environment, and business logic assumptions of any set of data. Just use the audit dimension as you would any other dimension. You can group by or constrain on any of the audit attributes, and you can use your normal query and reporting tools to display the results. You dont need a custom metadata reporting tool to do any of this.
We can explore many of the interesting data quality and lineage issues just by using the audit dimension as I described previously. But there are some questions that are defined only for aggregate data sets. The overall data quality score and the completeness of the data collection relative to the maximum possible measures we included in our original requirements list are good examples.
When you are reporting on the completeness of a set of data at an aggregate level, you cant mark a data element as missing if that means the record isnt in the database. This dilemma is an example of representing what didnt happen in a database. In all analyses of what didnt happen, somewhere in the database you have to describe two things: 1) what did happen, and 2) what is the universe of all possibilities. You then subtract #1 from #2 to find out what didnt happen. For a more complete perspective on this issue, see my Intelligent Enterprise article What Didnt Happen (February 16, 1999).
In our example, you need to decide how to encode #2, the universe of all possibilities. Sometimes you can do this by adding only a few records to the database, especially when you know the records exist but somehow your data delivery pipeline failed to deliver them in a timely way. For example, if you are a retail operation with 600 stores, and in this mornings load you received only 598 sets of data, then in your daily store totals aggregate fact table, you should include artificially generated records for the two missing stores. You will need an audit key and a shrunken audit dimension for the aggregate fact table. In this shrunken audit dimension, or possibly in the fact table, you can include completeness indicators. Now you can easily generate reports at various aggregate levels that include a measure of data completeness.
Building the Audit Dimension
I hope Ive made a compelling case for an audit dimensions usefulness. But like so many interesting designs in data warehousing, the hard part is actually supplying the data for the audit record. We need to actually build most of the quality and lineage indicators into the ETL pipeline. You need to augment each data flow in the ETL pipeline with exception-handling routines that diagnose each record as it is being built. In addition, the environment variables, including the date/timestamps and the version descriptions, need to be waiting to be put into the current audit record. This goal implies, of course, that you must carefully log all the required date/timestamps and version descriptions. (How does the saying go? If we had some ham, then we could have some ham and eggs; if we had some eggs .)
Ralph Kimball, Ph.D., co-invented the Star Workstation at Xerox and founder of Red Brick
Systems, works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit (Wiley, 1996) and the newly published The Data Warehouse Lifecycle Toolkit (Wiley, 1998). You can reach him
through his Web page at www.ralphkimball.com.
RESOURCESMeta Meta Data Data(DBMS, March 1998):www.dbmsmag.com/9803d05.html What Didnt Happen(Intelligent Enterprise, February 16, 1999): www.intelligententerprise.com/db_area/archives/1999/991602/warehouse.jhtml |
|
|
|
|
|




