http://www.intelligententerprise.com/010507/webhouse1_1.jhtml
Joint Effort
Administer the distributed data warehouse correctly or lose it!
by Ralph Kimball
As you may have noticed, my column's name has changed to "Data Warehouse Designer." This new name is more descriptive of the broader scope and more practical flavor of the subjects I am now covering here.
I will continue to invite a number of experienced data warehouse practitioners to alternate with me in writing these columns. I hope you will enjoy some of their fresh perspectives.
While we are on the subject of new things, this column's topic is about the most important new development in data warehousing: the distributed data warehouse (DDW). Inspired by the Web and the ability to share data through XML, a DDW arises whenever several organizations agree to share data.
A DDW encourages "drilling across" many separated databases to produce coordinated overall results. The most visible and urgent DDWs are supply chains in which confederations of manufacturers, distributors, and retailers share data in order to implement just-in-time "efficient consumer response." But there are many other examples of DDWs in other industries.
The advantages of a distributed architecture are enormous. In situations like supply chains, we simply can't build a centralized data warehouse. The separate organizations have separate technologies, IT staffs, data elements, and administrative rhythms. What ties these diverse entities together is a strong desire to share data, but not share much else!
With the benefit of a few years' experience building DDW architectures, we are beginning to understand the implications of administering and supporting these systems. We need to do our homework in understanding DDW administration, because if we do it incorrectly we will lose just about all the advantages we hope for.
Before outlining what it takes to administer a DDW, it's important to define carefully what it is. A DDW is a collection of organizations that:
- May be physically separated
- May use database technologies from diverse vendors, including both ROLAP and OLAP approaches
- Agree to use centrally administered, conformed dimensions
- Be willing to use centrally defined, conformed facts
- Make an agreement to publish their respective fact tables to the rest of the DDW, subject to role-based security control
- Agree to administer their individual databases according to the principles described in this column.
It should be evident that conformed dimensions are copied and distributed widely around the DDW, and that the conformed fact tables exist as single copies within their respective DDW member organizations.
Centralized Agreement
A DDW doesn't come about because a group of organizations just wants to share data. Significant planning, compromises, and technical development must take place before a DDW can go live. All the organizations participating in the DDW must embrace a centralized agreement. The members of the DDW agree to the following:
To define and use conformed dimensions. For instance, all members will use the Item dimension in a supply chain. All must be willing to live with the attribute names and attribute values in this centrally defined dimension. All the members of the DDW agree to rely on a single item czar who will administer and publish the Item dimension. This person takes the title "dimension authority."
Each conformed dimension in the overall DDW needs a dimension authority. Although not mandatory, it probably would work best to have all the dimension authorities work together in a single office.
To define and use conformed facts. For instance, all members must agree on the definition of the "revenue" measure in a DDW so that separate revenue measurements coming from separate fact tables can be combined mathematically. Unlike with conformed dimensions, the individual member organizations create these facts locally when they publish their own fact tables.
To use the surrogate keys contained in the conformed dimensions. The dimension authority creates a well-administered set of simple keys for each record in the dimension. These keys will usually be integers devoid of any structure, meaning, or order. The Calendar Date dimension is the only exception to the lack of ordering, because we usually use the Calendar Date surrogate key as the basis for physically partitioning the downstream fact tables. The surrogate keys are the primary keys of the dimensions, and all fact tables using conformed dimensions must use the surrogate key values as foreign keys.
To synchronize the addition of new fact records with the correct current release of a dimension. Each member organization makes sure that new fact records they add to their respective fact tables are processed using the correct surrogate keys defined in the most recent release of each dimension. If the member organization has implemented a surrogate key pipeline with lookup tables for finding the contemporary surrogate keys, these lookup tables must be updated whenever a new release of a dimension is received. See the article "Pipelining Your Surrogates" from the June 1998 issue of DBMS magazine (available at www.ralphkimball.com or www.dbmsmag.com/9806d05.html).
To distinguish between minor and major releases of dimensions. A minor release of a dimension occurs when new records have been added to the dimension but no existing records have been altered. In the parlance of slowly changing dimensions, a minor release allows fresh new records as well as Type 2 changes.
A major release of a dimension occurs when existing records have been altered with Type 1 or Type 3 changes. Each new dimension must carry a release version number. A properly constructed drill-across application requires that the version numbers for the particular dimension match as the drill-across operation is carried out. Mismatched version numbers are a warning that the definitions of some of the dimension attributes are not consistent across the DDW and that the results may be incorrect.
Not to modify the published content of a dimension. A member organization can add private attributes to a dimension without invalidating the conformed nature of the dimension and can delete records from the dimension if none of its fact tables use those records, but the values in the attributes the dimension authority supplies must remain unmodified.
A corollary of this rule is that dimensions will usually grow monotonically. A dimension authority can't delete records from a dimension, because the authority can't know if downstream fact tables are still using fact records.
Dimension Authority Duties
With this background, it's easier to understand the administrative steps the dimension authority must take. These steps are:
- Add fresh new records to the dimension, generating new surrogate keys.
- Add new records for Type 2 changes to existing dimension entries (true physical changes at a point in time), generating new surrogate keys.
- Modify records in place for Type 1 changes (overwrites) and Type 3 changes (alternate realities), without changing the surrogate keys. Update the version number of the dimension if any of these Type 1 or Type 3 changes are made.
- Replicate the revised dimension simultaneously to all fact table providers (members of the DDW).
Fact Table Provider Duties
The fact table provider has a more complex job. But, in exchange for this complexity, the fact table provider gets a complete suite of verbose, high-quality dimensions that should add considerable value to each fact table.
For instance, with exactly the same fact table, a really elaborate customer dimension offers far more opportunities for querying, reporting, and data mining than a narrow customer dimension does. With any luck, the extract-transform-load programs for processing each dimension will be similar, making the overall backroom software development task more manageable.
The fact table provider's administrative steps include:
- Receive or download dimension updates.
- Process dimension records marked as new and current, to update current key maps in the surrogate key pipeline.
- Process dimension records marked as new but postdated. This action triggers a complex alternative to the normal surrogate key pipeline processing. I described the detailed processing steps for this case in my September 29, 2000 column, "Backward in Time" (available at www.ralphkimball.com or www.intelligententerprise.com/000929/webhouse.jhtml).
- Add all new records to fact tables after replacing their natural keys with correct surrogate keys.
- Modify records in all fact tables for error correction, accumulating snapshots, and postdated dimension changes. You probably do this on a partition-by-partition basis.
- Remove invalidated aggregates. An existing historical aggregate becomes invalid only when a Type 1 or Type 3 change occurs on the attribute that is the target of the aggregation or if historical fact records have been modified in step 5. Changes to other attributes do not invalidate an aggregate. For instance, a change in the Flavor attribute of a product doesn't invalidate aggregates based on the Category attribute.
- Recalculate affected aggregates. If the new release of a dimension doesn't change the version number, then you need to extend aggregates only to handle newly loaded fact data. If the version number of the dimension has changed, then you may have to recalculate the entire historical aggregate if it was removed in step 6. OLAP systems may handle these steps automatically.
- Assure the quality of all base fact tables and aggregate tables. Be satisfied that the aggregate tables are correctly calculated.
- Bring updated fact and dimension tables online. The detailed strategy for taking a fact table (or more likely a partition of a fact table) offline for the briefest possible duration can be found in my book, The Data Webhouse Toolkit (Wiley, 2000), starting on page 645.
- Inform end users that the database has been updated. Tell the users if major changes have been made, including dimension version changes, postdated record additions, and changes to historical aggregates.
This column sketches just the key steps of administering a DDW. Certainly these steps are complex, but without them the DDW will be a failure.
When conformed dimensions and conformed facts are administered correctly and within the DDW framework, the separate organizational members of the DDW are quite free to develop their databases independently. Furthermore, the benefits of the work you put into producing the verbose, high-quality dimensions are multiplied across the whole DDW. In all probability, no single member of the DDW would have produced such good dimensions, but now everyone can drill across with this high-quality detail.
You may have noticed that I didn't talk about administering the role definitions for security across the DDW. Give me a few weeks and we'll tackle that one, too.
Ralph Kimball co-invented the Star Workstation at Xerox and founded Red Brick Systems. He has three best-selling data warehousing books in print, including The Data Webhouse Toolkit (Wiley, 2000). He teaches dimensional data warehouse design through Kimball University and posts data warehouse design tips, which are available through his Web site, www.rkimball.com.