Welcome Guest. | Log In| Register | Membership Benefits

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Home
Digital Library
Events
RSS | Newsletters
Webcasts




September 17, 2002

Two Powerful Ideas

The foundations for modern data warehousing

by Ralph Kimball

There are two powerful ideas at the foundation of most successful data warehouses. First, separate your systems. Second, build stars and cubes.

In my previous column, I described a complete spectrum of design constraints and unavoidable realities facing the data warehouse designer. This was such a daunting list that I worried that you would head for the door. But maybe what kept you reading this far was my promise to dig us out of the morass. This is where the two powerful ideas come in.

Last time, I stated that the nonnegotiable constraints on a data warehouse design were end-user understandability and query execution speed. A complex, slow data warehouse is a failure no matter how elegant the rest of the design may be because the people for whom it was intended won't want to use it.

All the rest of the constraints and unavoidable realities were pragmatic and honest admissions that the data warehouse design space is extremely complex. The source data, the source systems, the database technologies, and the business environments we must deal with are incredibly complicated. So, as good engineers, to dig our way out of the mire, we must decompose the problem into separate manageable parts and emphasize techniques that are predictable, reusable, and robust when our design environment is altered.

Separate Your Systems

The first, crucial step in designing a data warehouse across a complex enterprise is to separate your systems logically, physically, and administratively.

I've found it very useful to think of the project as four distinct and different systems, of which a data warehouse manager should be responsible for only two. I won't overwhelm you with the typical block diagram because the issues are simpler than that. The four systems are:

  • Production (source) transaction processing systems
  • Data warehouse staging area systems
  • Data warehouse presentation systems, including client/server and Web-based query tools and report writers
  • Optional high-end analytic tools supporting data mining, forecasting, scoring, or allocating.

As a data warehouse manager, you shouldn't be responsible for the source systems that are capturing and processing transactions. That's someone else's job. You don't want to be involved in supporting the legal and financial auditing functions or the rollback and recovery functions of these systems. They are the cash registers of the company, and their priorities are different from those of the data warehouse.

The first system for which the data warehouse is responsible is the data staging area, where production data from many sources is brought in, cleaned, conformed, combined, and ultimately delivered to the data warehouse presentation systems. Much has been written about the crucial extract-transform-load (ETL) steps in the staging area, but stepping away from this detail, the main requirement for the staging area is that it is off limits to all final data warehouse clients. The staging area is exactly like the kitchen in a restaurant. The kitchen is a busy, even dangerous, place filled with sharp knives and hot liquids. The cooks are busy, focused on the task of preparing the food. It just isn't appropriate to allow diners into a professional kitchen or allow the cooks to be distracted with the very separate issues of the fine dining experience. In data warehouse terms, by barring all data warehouse clients from the data staging area, we avoid:

  • Guaranteeing up-time service levels for querying or reporting
  • Enforcing client-level security
  • Building performance-enhancing indexes and aggregations for query performance
  • Handling logical and physical conflicts between the querying and data cleaning steps
  • Guaranteeing consistency across separate, asynchronous data sources.

The two dominant data structures in the data staging area are the flat file and the entity/relationship schema, which are directly extracted or derived from the production systems. Almost all processing in the staging area is either sorting or simple sequential processing.

The second major system under the specific control of the data warehouse is the presentation system. Of course, this system is analogous to the dining area of a fine restaurant. The dining area is organized for the comfort of the diners. The food is delivered promptly and in the most appealing way, and hassles and distractions are avoided as much as possible. In the same way, the data warehouse presentation system is purpose-built to enhance the query and reporting experience. The presentation system needs to be simple and fast and present the right data to meet the analysis needs of the end users. Also, in the presentation system, we can easily handle the foregoing bullet list of requirements that we excluded from the staging area.

The dominant data structures in the presentation area are the relational star schema and the online analytic processing (OLAP) data cube. Processing in the presentation area must respond to a blizzard of large and small queries coming in from every possible angle on the data. Over time, there will be no predictable pattern to these queries. Some designers call this the ad hoc attack.

The fourth system on our list is an optional layer of specific high-end analytic tools that often consume data warehouse data in batch quantities. Frequently these data mining, forecasting, scoring, and allocating tools use specialized algorithms outside the normal expertise of the data warehouse designer.

And, honestly, many of these processes have an interpretive or political component that's wisely segregated from the data warehouse. For example, data mining as a discipline is a complex interpretive task involving a whole collection of powerful analytic techniques, many of which the end-user community doesn't readily understand or trust. Proper data mining requires a professional data-mining expert who's equipped to use the tools effectively and represent the data mining results to the community.

In addition, as I've often stated, there's a fundamental impedance mismatch between data mining and the data warehouse. Data mining frequently needs to look at thousands or millions of "observations" over and over again, at extremely high data rates. This just isn't easy to support from the data warehouse directly. Better to hand the observation set over to the data-mining team, just once.






IE Weekly Newsletter
Subscribe to the newsletter
    Email Address