A Trio of Interesting SnowflakesBeat three common modeling challenges with extensions of the dimensional modelBy Ralph KimballContinued from Page 1 Note the importance of including the recency and frequency information as dimensional attributes rather than as facts and overwriting them as time progresses. This decision makes the shopper dimension very powerful. You can do classic shopper segmentation directly off the dimension without navigating a fact table in a complex application. See the discussion of this kind of segmentation in my book, The Data Webhouse Toolkit, starting on page 73. Assuming that many of the final 50 customer attributes are textual, you could have a total record width of 500 bytes or more. Suppose you have 20 million shoppers (16 million visitors and four million registered customers). Obviously, you are worried that in 80 percent of your records, the trailing 50 fields contain no data! In a 10GB dimension, this condition gets your attention. This is a clear case where, depending on the database, you want to introduce a snowflake. You should break the dimension into a base dimension and a snowflake subdimension. All the visitors share a single record in the subdimension, which contains special null attribute values. (See FIGURE 1.) FIGURE 1. A Shopper dimension where 80 percent of the records have 50 null attributes In a fixed-width database, using our previous assumptions, the base shopper dimension is 20 million x 25 bytes=500MB, and the snowflake dimension is 4 million x 475 bytes=1.9GB. You save 8GB by using the snowflake. If you have a query tool that insists on a classic star schema with no snowflakes, then you can hide the snowflake under a view declaration. FINANCIAL PRODUCT DIMENSIONSBanks, brokerage houses, and insurance companies all have trouble modeling their product dimensions because each of the individual products has a host of special attributes not shared by other products. Except for a set of common "core" attributes, a checking account doesn't look very much like a mortgage or certificate of deposit. They even have different numbers of attributes. If you try to build a single product dimension with the union of all possible attributes, you end up with hundreds of attributes, most of which are empty in a given record. The answer in this case is to build a context-dependent snowflake. You isolate the core attributes in a base product dimension table, and include a snowflake key in each base record that points to its proper extended product subdimension. (See Figure 2.) FIGURE 2. A Financial Product dimension with a subdimension for each product type. This solution is not a conventional relational join! The snowflake key must connect to the particular subdimension table that a specific product type defines. Usually you can accomplish this task by constructing a relational view for each product type that hardwires the correct join path. MULTIENTERPRISE CALENDAR DIMENSIONSBuilding a calendar dimension in a distributed data warehouse spanning multiple organizations is difficult because each organization has idiosyncratic fiscal periods, seasons, and holidays. Although you should make a heroic effort to reduce incompatible calendar labels, many times you want to look at the overall multienterprise data through the eyes of just one of the organizations. Unlike the financial products dimensions, each of the separate calendars can have the same number of attributes describing fiscal periods, seasons, and holidays. But there may be hundreds of separate calendars. An international retailer may have to deal with a calendar for each foreign country. In this case you modify the snowflake design to let the snowflake key join to a single calendar subdimension. (See Figure 3.) But the subdimension has higher cardinality than the base dimension! The key for the subdimension is both the snowflake key and the organization key. FIGURE 3. A Calendar dimension with a higher cardinality subdimension. In this situation, you must specify a single organization in the subdimension before evaluating the join between the tables. When done correctly, the subdimension has a one-to-one relationship with the base dimension as if the two tables were a single entity. Now the entire multienterprise data warehouse can be queried through the calendar of any constituent organization. PERMISSIBLE SNOWFLAKESThese three examples show how variations of snowflake designs can be very useful. I hope you feel more confident about answering the question, "When can I use a snowflake?" When you are thinking about design alternatives, you should separate the issues of physical design from those of logical design. Physical design drives performance. Logical design drives understandability. You can certainly use snowflake designs if you maximize both of these goals. RALPH KIMBALL co-invented the Star Workstation at Xerox and founded Red Brick Systems. He has three best-selling data warehousing books in print, including The Data Webhouse Toolkit (Wiley, 2000). He teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. You can reach him through his Web site, www.rkimball.com.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|




