Doing the Work at Extract TimeCountering the minimalist approachYour mission as data warehouse designers is to publish your data most effectively. Fulfilling this mission means placing the data in the format and framework that is easiest for end users and application developers to use. Perhaps an apt analogy is a well-equipped kitchen. Ideally all the ingredients and tools needed to cook a meal are within easy reach. The ingredients are exactly what the recipe calls for, and the tools are meant for the job. Great cooks are notorious for using elaborate procedures out of view in the kitchen to produce just the right effect when the plate is served; so by analogy the back-room "chefs" in the data warehouse should do much the same. In the back room of your data warehouse, you must counter your natural minimalist tendencies when preparing the data for end users' and application developers' final consumption. In many important cases, you should deliberately trade off increased back-room processing and increased front-room storage requirements in exchange for symmetrical, predictable schemas that users understand, reduced application complexity, and improved performance of queries and reports. Making these trade-offs should be an explicit design goal for the data warehouse architect. But of course you must choose these trade-offs judiciously. A little sugar may make the recipe better, but a lot more sugar produces an inedible mess. Let's look at half a dozen situations where you'll do just enough work at extract time to really make a difference. I'll also draw some boundaries so you'll know when not to overdo it and spoil the final result. Let's start with some rather narrow examples and gradually expand our scope. MODELING EVENTS ACROSS MULTIPLE TIME ZONESVirtually all measurements recorded in data warehouses have one or more timestamps. Sometimes these timestamps are simple calendar dates, but increasingly, we record the exact time to the minute or the second. Also, most enterprises span multiple time zones, whether in the United States, Europe, Asia, or around the world. You therefore have a natural dilemma. Either you standardize all your timestamps to a single well-identified time zone or to the correct local wall clock time when the local measurement event occurred. You might be tempted to say, OK, so what's the big deal? If we want to convert from Greenwich Mean Time to local time, we just figure out which time zone the measurement was taken in, and we apply a simple offset. Unfortunately, this doesn't work. In fact, it fails spectacularly. The rules for international time zones are horrendously complicated. There are more than 500 separate geographic regions with distinct time zone rules. Moral of the story: Don't compute time zones in your application. Rather, add an extra timestamp foreign key in every place you have an existing timestamp, and put both standard and local times in your data. In other words, do the work at extract time, not query time, and give up a little data storage. VERBOSE CALENDAR DIMENSIONSAll fact tables in dimensional data warehouse schemas should eschew native calendar date stamps in favor of integer valued foreign keys that connect to verbose calendar dimensions. This recommendation is not based on a foolish dimensional design consistency, but rather it recognizes that calendars are complicated and applications typically need a lot of navigation help. For example, native SQL date stamps do not identify the last day of the month. Using a proper calendar date dimension, the last day of the month can be identified with a Boolean flag, making applications simple. Just imagine writing a query against a simple SQL date stamp that would constrain on the last day of each month. Here is an example where adding the machinery for an explicit calendar date dimension simplifies queries and speeds up processing by avoiding complex SQL. There are some situations where you actually want to keep the native date stamp without creating a join to a dimension. If the date stamp in question falls outside the range of your enterprise date dimension, it may be nonsensical to extend the date dimension to encompass the date in question. Birth dates far in the past and financial expiration dates far in the future often fall into these questionable categories. KEEPING THE BOOKS ACROSS MULTIPLE CURRENCIESThe multiple time zone perspective discussed in the first example often occurs with the related issue of modeling transactions in multiple currencies. Again, there are two equal and legitimate perspectives. If the transaction took place in a specific currency (say, Swiss franks) you obviously want to keep that information exactly. But if you have a welter of currencies, you'll find it hard to roll up results to an international total. Again, take the similar approach of expanding every currency denominated field in the data warehouse to be two fields: local and standard currency value. In this case, you also need to add a currency dimension to each fact table to unambiguously identify the local currency. The location of the transaction is not a reliable indicator of the local currency type.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|




